Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

2019-03-18 Thread Mithun Cy
t tOn Mon, Feb 25, 2019 at 12:10 AM Mithun Cy  wrote:

> Thanks Hans, for a simple reproducible tests.
>
> The  "worker.bgw_restart_time" is never set for autoprewarm workers so on
> error it get restarted after some period of time (default behavior). Since
> database itself is dropped our attempt to connect to that database failed
> and then worker exited. But again got restated by postmaster then we start
> seeing above DSM segment error.
>
> I think every autoprewarm worker should be set with
> "worker.bgw_restart_time = BGW_NEVER_RESTART;" so that there shall not be
> repeated prewarm attempt of a dropped database. I will try to think further
> and submit a patch for same.
>

Here is the patch for same,

autoprewarm waorker should not be restarted. As per the code
@apw_start_database_worker@
master starts a worker per database and wait until it exit by calling
WaitForBackgroundWorkerShutdown.  The call WaitForBackgroundWorkerShutdown
cannot handle the case if the worker was restarted. The
WaitForBackgroundWorkerShutdown() get the status BGWH_STOPPED from the call
GetBackgroundWorkerPid() if worker got restarted. So master will next
detach the shared memory and next restarted worker keep failing going in a
unending loop.

I think there is no need to restart at all. Following are the normal error
we might encounter.
1. Connecting database is droped -- So we need to skip to next database
which master will do by starting a new wroker. So not needed.
2. Relation is droped -- try_relation_open(reloid, AccessShareLock) is used
so error due to dropped relation is handled also avoids concurrent
truncation.
3. smgrexists is used before reading from a fork file. Again error is
handled.
4. before reading the block we have check as below. So previously truncated
pages will not be read again.
/* Check whether blocknum is valid and within fork file size. */
if (blk->blocknum >= nblocks)

I think if any other unexpected errors occurs that should be fatal so
restarting will not be correcting same. Hence there is no need to restart
the per database worker process.

I tried to dig why we did not set it earlier. It used to be never restart,
but it changed after fixing comments [1]. At that time we did not make
explicit database connection per worker and did not handle many error cases
as now. So it appeared fair. But, when code changed to make database
connection per worker, we should have set every worker with
BGW_NEVER_RESTART. Which I think was a mistake.

NOTE : On zero exit status we will not restart the bgworker (see
@CleanupBackgroundWorker@
and @maybe_start_bgworkers@)
[1]
https://www.postgresql.org/message-id/CA%2BTgmoYNF_wfdwQ3z3713zKy2j0Z9C32WJdtKjvRWzeY7JOL4g%40mail.gmail.com
-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com


never_restart_apw_worker_01.patch
Description: Binary data


Re: Online verification of checksums

2019-03-18 Thread Michael Banck
Hi,

Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> * Michael Paquier (mich...@paquier.xyz) wrote:
> > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > To be clear, I agree completely that we don't want to be reporting false
> > > positives or "this might mean corruption!" to users running the tool,
> > > but I haven't seen a good explaination of why this needs to involve the
> > > server to avoid that happening.  If someone would like to point that out
> > > to me, I'd be happy to go read about it and try to understand.
> > 
> > The mentions on this thread that the server has all the facility in
> > place to properly lock a buffer and make sure that a partial read
> > *never* happens and that we *never* have any kind of false positives,
> 
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

I think the current patch (V13 from https://www.postgresql.org/message-i
d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.

> I do not understand, at all, the whole sub-thread argument that we have
> to avoid partial reads.  We certainly don't worry about that when doing
> backups, and I don't see why we need to avoid it here.  We are going to
> have partial reads- and that's ok, as long as it's because we're at the
> end of the file, and that's easy enough to check by just doing another
> read to see if we get back zero bytes, which indicates we're at the end
> of the file, and then we move on, no need to coordinate anything with
> the backend for this.

Well, I agree with you, but we don't seem to have consensus on that.

> > directly preventing the set of issues we are trying to implement
> > workarounds for in a frontend tool are rather good arguments in my
> > opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
> > for example).
> 
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
> 
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it.  I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

It is not in the current patch, it should just get reported as a skipped
block in the end.  If the cluster is online that is, if it is offline,
we do consider it a failure.

I have now rebased that patch on top of the pg_verify_checksums ->
pg_checksums renaming, see attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..124475f057 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   Description
   
pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
  
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..0ed065f7e9 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _

RE: SQL statement PREPARE does not work in ECPG

2019-03-18 Thread Matsumura, Ryo
Meskes-san

Thank you for your comment.

> One question though, why is the statement name always quoted? Do we
> really need that? Seems to be more of a hassle than and advantage.

The following can be accepted by preproc, ecpglib, libpq, and backend in 
previous versions.
  exec sql prepare "st x" from "select 1";
  exec sql execute "st x";

The above was preprocessed to the following.
  PQprepare(conn, "\"st x\"", "select 1");
  PQexec(conn, "\"st x\"");

By the way, the following also can be accepted.
  PQexecParams(conn, "prepare \"st x\" ( int ) as select $1", 0, NULL, NULL, 
NULL, NULL, 0);
  PQexecParams(conn, "execute \"st x\"( 1 )", 0, NULL, NULL, NULL, NULL, 0);

Therefore, I think that the quoting statement name is needed in PREPARE/EXECUTE 
case, too.

> I would prefer to merge as much as possible, as I am afraid that if we
> do not merge the approaches, we will run into issues later. There was a
> reason why we added PQprepare, but I do not remember it from the top of
> my head. Need to check when I'm online again.

I will also consider it.

Regards
Ryo Matsumura


Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Magnus Hagander
On Mon, Mar 18, 2019 at 7:08 AM Stephen Frost  wrote:

> Greetings,
>
> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Thu, Mar 14, 2019 at 7:34 PM Peter Eisentraut
> >  wrote:
> > > I think the potential problems of getting this wrong are bigger than
> the
> > > issue we are trying to fix.
> >
> > I think the question is: how do we know what the user intended?  If
> > the user wants the directory to be accessible only to the owner, then
> > we ought to set the permissions on the directory itself and of
> > everything inside it to 0700 (or 0600).  If they want group access, we
> > should set everything to 0750 (or 0644).  But how do we know what the
> > user wants?
> >
> > Right now, we take the position that the user wants the individual
> > files to have the same mode that they do on the master, but the
> > directory should retain its existing permissions.  That appears to be
> > pretty silly, because that might end up creating a bunch of files
> > inside the directory that are marked as group-readable while the
> > directory itself isn't; surely nobody wants that.  Adopting this patch
> > would fix that inconsistency.
> >
> > However, it might be better to go the other way.  Maybe pg_basebackup
> > should decide whether group permission is appropriate for the
> > contained files and directories not by looking at the master, but by
> > looking at the directory into which it's writing.  The basic objection
> > to this patch seems to be that we should not assume that the user got
> > the permissions on the existing directory wrong, and I think that
> > objection is fair, but if we accept it, then we should ask why we're
> > setting the permission of everything under that directory according to
> > some other methodology.
>
> Going based on the current setting of the directory seems defensible to
> me, with the argument of "we trust you created the directory the way you
> want the rest of the system to be".
>

Which I believe is also how a plain unix cp (or tar or whatever) would
work, isn't it? I think that alone is a pretty strong reason to work the
same as those -- they're not entirely unsimilar.


> Another option would be to provide a pg_basebackup option to allow the
> > user to specify what they intended i.e.  --[no-]group-read.  (Tying it
> > to -R doesn't sound like a good decision to me.)
>
> I definitely think that we should add an option to allow the user to
> tell us explicitly what they want here, even if we also go based on what
> the created directory has (and in that case, we should make everything,
> including the base directory, follow what the user asked for).
>

+1 for having an option to override whatever the default is.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Online verification of checksums

2019-03-18 Thread Michael Paquier
On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

Ere, my apologies here.  I meant the read of a torn page, not a
partial read (when extending the relation file we have locks
preventing from a partial read as well by the way).
--
Michael


signature.asc
Description: PGP signature


Re: Online verification of checksums

2019-03-18 Thread Stephen Frost
Greetings,

* Michael Banck (michael.ba...@credativ.de) wrote:
> Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > To be clear, I agree completely that we don't want to be reporting false
> > > > positives or "this might mean corruption!" to users running the tool,
> > > > but I haven't seen a good explaination of why this needs to involve the
> > > > server to avoid that happening.  If someone would like to point that out
> > > > to me, I'd be happy to go read about it and try to understand.
> > > 
> > > The mentions on this thread that the server has all the facility in
> > > place to properly lock a buffer and make sure that a partial read
> > > *never* happens and that we *never* have any kind of false positives,
> > 
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
> 
> I think the current patch (V13 from https://www.postgresql.org/message-i
> d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.

I think the question here is- do you ever see false positives with this
latest version..?  If you are, then that's an issue and we should
discuss and try to figure out what's happening.  If you aren't seeing
false positives, then it seems like we're done here, right?

> > I do not understand, at all, the whole sub-thread argument that we have
> > to avoid partial reads.  We certainly don't worry about that when doing
> > backups, and I don't see why we need to avoid it here.  We are going to
> > have partial reads- and that's ok, as long as it's because we're at the
> > end of the file, and that's easy enough to check by just doing another
> > read to see if we get back zero bytes, which indicates we're at the end
> > of the file, and then we move on, no need to coordinate anything with
> > the backend for this.
> 
> Well, I agree with you, but we don't seem to have consensus on that.

I feel like everyone is concerned that we'd report an acceptable partial
read as a failure, hence it would be a false positive, and I agree
entirely that we don't want false positives, but the answer to that
seems to be that we shouldn't report partial reads as failures, solving
the problem in a simple way that doesn't involve the server and doesn't
materially reduce the check that's being performed.

> > > directly preventing the set of issues we are trying to implement
> > > workarounds for in a frontend tool are rather good arguments in my
> > > opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
> > > for example).
> > 
> > Sure the backend has those facilities since it needs to, but these
> > frontend tools *don't* need that to *never* have any false positives, so
> > why are we complicating things by saying that this frontend tool and the
> > backend have to coordinate?
> > 
> > If there's an explanation of why we can't avoid having false positives
> > in the frontend tool, I've yet to see it.  I definitely understand that
> > we can get partial reads, but a partial read isn't a failure, and
> > shouldn't be reported as such.
> 
> It is not in the current patch, it should just get reported as a skipped
> block in the end.  If the cluster is online that is, if it is offline,
> we do consider it a failure.

Ok, that sounds fine- and do we ever see false positives now?

> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Thanks for that.  Reading through the code though, I don't entirely
understand why we're making things complicated for ourselves by trying
to seek and re-read the entire block, specifically this:

>   if (r != BLCKSZ)
>   {
> - fprintf(stderr, _("%s: could not read block %u in file 
> \"%s\": read %d of %d\n"),
> - progname, blockno, fn, r, BLCKSZ);
> - exit(1);
> + if (online)
> + {
> + if (block_retry)
> + {
> + /* We already tried once to reread the 
> block, skip to the next block */
> + skippedblocks++;
> + if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
> + {
> + skippedfiles++;
> + fprintf(stderr, _("%s: could 
> not lseek to next block in file \"%s\": %m\n"),
> + progname, fn);
> + return;
> + }
> + continue;
> + }
> +
> + 

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
> 
> Ere, my apologies here.  I meant the read of a torn page, not a

In the case of a torn page, we should be able to check the LSN, as
discussed extensively previously, and if the LSN is from after the
checkpoint we started at then we should be fine to skip the page.

> partial read (when extending the relation file we have locks
> preventing from a partial read as well by the way).

Yes, we do, in the backend...  We don't have (nor do we need) to get
involved in those locks for these tools though..

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Online verification of checksums

2019-03-18 Thread Michael Banck
Hi,

Am Montag, den 18.03.2019, 08:18 +0100 schrieb Michael Banck:
> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Sorry, I had missed some hunks in the TAP tests, fixed-up patch
attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..124475f057 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   Description
   
pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
  
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..0ed065f7e9 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -108,26 +119,130 @@ scan_file(const char *fn, BlockNumber segmentno)
 
 		if (r == 0)
 			break;
+		if (r < 0)
+		{
+			skippedfiles++;
+			fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+	progname, blockno, fn, strerror(errno));
+			return;
+		}
 		if (r != BLCKSZ)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-	progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			if (online)
+			{
+if (block_retry)
+{
+	/* We already tried once to reread the block, skip to the next block */
+	skippedblocks++;
+	if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
+	{
+		skippedfiles++;
+		fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
+progname, fn);
+		return;
+	}
+	continue;
+}
+
+/*
+ * Retry the block. It's possible that we read the block while it
+ * was extended or shrinked, so it it ends up looking torn to us.
+ */
+
+/*
+ * Seek back by the amount of bytes we read to the beginning of
+ * the failed block.
+ */
+if (lseek(f, -r, SEEK_CUR) == -1)
+{
+	skippedfiles++;
+	fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+			progname, fn);
+	return;
+}
+
+/* Set flag so we know a retry was attempted */
+block_retry = true;
+
+/* Reset loop to validate the block again */
+blockno--;
+
+continue;
+			}
+			else
+			{
+skippedfiles++;
+fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
+		progname, blockno, fn, r, BLCKSZ);
+return;
+			}
 		}
-		blocks++;
 
 		/* New pages have no checksum yet */
 		if (PageIsNew(header))
+		{
+			skippedblocks++;
 			continue;
+		}
+
+		blocks++;
 
 		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
 		if (csum != header->pd_checksum)
 		{
+			if (online)
+			{
+/*
+ * Retry the block on the first failure if online.  If the
+ * verification is done while the instance is online, it is
+ * possible that we read the first 4K page of the block just
+ * before postgres updated the entire block so it ends up
+ * looking torn to us.  We only need to retry once because the
+ * LSN should be updated to something we can ignore on the next
+ * pass.  If the error happens again then it is a true
+ *

Re: Data-only pg_rewind, take 2

2019-03-18 Thread Chris Travers
On Mon, Mar 18, 2019 at 4:09 AM Michael Paquier  wrote:

> On Sun, Mar 17, 2019 at 09:00:57PM +0800, Chris Travers wrote:
> > I also added test cases and some docs.  I don't know if the docs are
> > sufficient.  Feedback is appreciated.
>
> To be honest, I don't think that this approach is a good idea per the
> same reasons as mentioned the last time, as this can cause pg_rewind
> to break if any newly-added folder in the data directory has
> non-replaceable data which is needed at the beginning of recovery and
> cannot be automatically rebuilt.  So that's one extra maintenance
> burden to worry about.
>

Actually I think this is safe.  Let me go through the cases not handled in
the current behavior at all:

1.  For rpms we distribute, we clobber db logs, which means we overwrite
application logs on the failed system with copes of logs from a replica.
This means that after you rewind, you lose the ability to figure out what
went wrong.  This is an exquisitely bad idea unless you like data loss, and
since this location is configurable you can't just say "well we put our
logs here so we are excluding them."  Making this configured per rewind run
strikes me as error-prone and something that will may lead to hidden
interference with postmortems in the future, and postmortems are vitally
important in terms of running database clusters with any sort of
reliability guarantees.

2.  With the PostgreSQL.conf.auto now having recovery.conf info, you have
some very significant failure cases with regard to replication and
accidentally clobbering these.

On to the corner cases with --data-only enabled and the implications as I
see them since this preserves files on the old master but does not copy
them from the replica:

1.  If the changes are not wal logged (let's say CSVs created using a file
foreign data wrapper), then deleting the files on rewind is where you can
lose data, and --data-only avoids this, so here you *avoid* data loss where
you put state files on the systems and do not rewind them because they were
not wal-logged.  However the files still exist on the old master and are
not deleted, so the data can easily be restored at that point.  Now, we can
say, probably, that putting data files in $PGDATA that are not wal-logged
is a bad idea.  But even if you put them somewhere else, pg_rewind isn't
going to magically move them over to the replica for you.

2.  If the changes *are* wal-logged, then you have a problem with
--data-only which is not present without it, namely that files can get out
of sync with their wal-logged updates.  So in this case, --data-dir is
*not* safe.

So here I think we have to issue a choice.  For now I don't feel
comfortable changing the default behavior, but the default behavior could
cause data loss in certain cases (including the ones I think you are
concerned about).  Maybe it would be better if I document the above points?


>
> Here is the reference of the last thread about the same topic:
>
> https://www.postgresql.org/message-id/can-rpxd8y7hmojzd93hoqv6n8kpeo5cmw9gym+8jirtpifn...@mail.gmail.com
> --
> Michael
> -BEGIN PGP SIGNATURE-
>
> iQIzBAABCgAdFiEEG72nH6vTowiyblFKnvQgOdbyQH0FAlyPGgYACgkQnvQgOdby
> QH0ekRAAiXcZRcDZwwwdbdlIpkniE/SuG5gaS7etUcAW88m8Vts5r4QoAEwUwGhg
> EZzuOb77OKvti7lmOZkBgC0VB1PmFku+mIdqJtzvdcSDdlOkABcLaw4JRrm//2/7
> jAi5Jw4um1EAz38dZXcWYwORavyo/4tR2S1PCyBA35F704w2NILAEDiq233P/ALf
> M3cOjgwiFIPf0v9PJIfYsl56sIwqW4rofPH63V6teaz5W8Qf2zHSsG5CeNqnEix0
> QZwwlzuhtAUYINab3oN3qMtF2q9vzJWCoSprzxx1qYrzPHEX8EMot0+L7YPdaAp0
> xyiUKSzy1rXtpoW0rsJ7w5bdrh1gS7HzprCEtqRZGe6NlVDcNjXfJIG9sT6hMWYS
> GTNbVH5VpKziw3byT8JpyqR38+iFqeXoLd1PEVadYjP62qOWbK8P2wokQwM+7EcK
> Hpr8jrvgV5x8IEnhR4bPyTqjORCJMBGTXCNgT99cPYpuVSasr/0IsBC/RtmQfRB9
> xhK0/qp5koQbX+mbLK11XsaFS9JAL2DNmSQg8TqICtV3bb0UTThs331XgjEjlOpm
> 1RjM6Tzwqq2is04mkkT+DtRAOclQuL8wWJWU5rr4fMKHCeFxtvUfwTyKlo2u+mI0
> x7YZhd4AFCM14ga2Ko/qiGqeOWR5Y0RvYANmnmjG5bxQGi+Dtek=
> =LNZB
> -END PGP SIGNATURE-
>


-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: Online verification of checksums

2019-03-18 Thread Michael Banck
Hi.

Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> * Michael Banck (michael.ba...@credativ.de) wrote:
> > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > To be clear, I agree completely that we don't want to be reporting 
> > > > > false
> > > > > positives or "this might mean corruption!" to users running the tool,
> > > > > but I haven't seen a good explaination of why this needs to involve 
> > > > > the
> > > > > server to avoid that happening.  If someone would like to point that 
> > > > > out
> > > > > to me, I'd be happy to go read about it and try to understand.
> > > > 
> > > > The mentions on this thread that the server has all the facility in
> > > > place to properly lock a buffer and make sure that a partial read
> > > > *never* happens and that we *never* have any kind of false positives,
> > > 
> > > Uh, we are, of course, going to have partial reads- we just need to
> > > handle them appropriately, and that's not hard to do in a way that we
> > > never have false positives.
> > 
> > I think the current patch (V13 from https://www.postgresql.org/message-i
> > d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.
> 
> I think the question here is- do you ever see false positives with this
> latest version..?  If you are, then that's an issue and we should
> discuss and try to figure out what's happening.  If you aren't seeing
> false positives, then it seems like we're done here, right?

What do you mean with false positives here? I've never seen a bogus
checksum failure, i.e. pg_checksums claiming some checksum is wrong
cause it only read half of a block or a torn page.

I do see sporadic partial reads and they get treated by the re-check
logic and (if that is not enough) get tallied up as a skipped block in
the end.  Is that a false positive in your book?

[...]

> > I have now rebased that patch on top of the pg_verify_checksums ->
> > pg_checksums renaming, see attached.
> 
> Thanks for that.  Reading through the code though, I don't entirely
> understand why we're making things complicated for ourselves by trying
> to seek and re-read the entire block, specifically this:

[...]

> I would think that we could just do:
> 
>   insert_location = 0;
>   r = read(BLCKSIZE - insert_location);
>   if (r < 0) error();
>   if (r == 0) EOF detected, move to next
>   if (r < (BLCKSIZE - insert_location)) {
> insert_location += r;
> continue;
>   }
> 
> At this point, we should have a full block, do our checks...

Well, we need to read() into some buffer which you have ommitted.

So if we had a short read, and then read the rest of the block via
(BLCKSIZE - insert_location) wouldn't we have to read that in a second
buffer and then join the two in order to compute the checksum?  That
does not sounds simpler to me than just re-reading the block entirely.

> Have you seen cases where the kernel will actually return a partial read
> for something that isn't at the end of the file, and where you could
> actually lseek past that point and read the next block?  I'd be really
> curious to see that if you can reproduce it...  I've definitely seen
> empty pages come back with a claim that the full amount was read, but
> that's a very different thing.

Well, I've seen partial reads and I have seen very rarely that it will
continue to read another block afterwards.  If the relation is being
extended while we check it, it sounds plausible that another block could
be written before we get to read EOF on the next read() after a partial
read() so that does not sounds like a bug to me either.

I might be misunderstanding your question though?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-03-18 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 18 Mar 2019 03:03:04 +, "Iwata, Aya"  
wrote in <71E660EB361DF14299875B198D4CE5423DF05777@g01jpexmbkw25>
> This patch does not apply.  Please refer to http://commitfest.cputube.org/ 
> and update it.
> How about separating your patch by feature or units that can be phased commit.
> For example, adding assert macro at first, refactoring StdRdOptions by the 
> next, etc.
> 
> So I change status to "Waiting for Author".

That seems to be a good oppotunity. I have some comments.

rel.h:
-#define RelationGetToastTupleTarget(relation, defaulttarg) \
-((relation)->rd_options ? \
- ((StdRdOptions *) (relation)->rd_options)->toast_tuple_target : 
(defaulttarg))
+#define RelationGetToastTupleTarget(relation, defaulttarg) \
+(AssertMacro(relation->rd_rel->relkind == RELKIND_RELATION ||\
+ relation->rd_rel->relkind == RELKIND_MATVIEW),\
+ ((relation)->rd_options ? \
+  ((HeapRelOptions *) (relation)->rd_options)->toast_tuple_target : \
+(defaulttarg)))

Index AMs parse reloptions by their handler
functions. HeapRelOptions in this patch are parsed in
relation_reloptions calling heap_reloptions. Maybe at least it
should be called as table_options. But I'm not sure what is the
desirable shape of relation_reloptions for the moment.

hio.c:

-saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
-   HEAP_DEFAULT_FILLFACTOR);
+if (IsToastRelation(relation))
+saveFreeSpace = ToastGetTargetPageFreeSpace();
+else
+saveFreeSpace = HeapGetTargetPageFreeSpace(relation);

This locution appears four times in the patch and that's the all
where the two macros appear. And it might not be good that
RelationGetBufferForTuple identifies a heap directly since I
suppose that currently tost is a part of heap. Thus it'd be
better that HeapGetTargetPageFreeSpace handles the toast case.
Similary, it doesn't looks good that RelationGetBufferForTuple
consults HeapGetTargretPageFreeSpace() directly. Perhaps it
should be called via TableGetTargetPageFreeSpace(). I'm not sure
what is the right shape of the relationship among a relation and
a table and other kinds of relation. extract_autovac_opts
penetrates through the modularity boundary of toast/heap in a
similar way.


plancat.c:
+if (relation->rd_rel->relkind == RELKIND_RELATION ||
+relation->rd_rel->relkind == RELKIND_MATVIEW)
+rel->rel_parallel_workers = RelationGetParallelWorkers(relation, -1);
+else
+rel->rel_parallel_workers = -1;

rel.h:
#define RelationGetParallelWorkers(relation, defaultpw) \
(AssertMacro(relation->rd_rel->relkind == RELKIND_RELATION ||\
 relation->rd_rel->relkind == RELKIND_MATVIEW),\
 ((relation)->rd_options ? \
  ((HeapRelOptions *) (relation)->rd_options)->parallel_workers : \
(defaultpw)))

These function/macros are doing the same check twice at a call.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Online verification of checksums

2019-03-18 Thread Stephen Frost
Greetings,

On Mon, Mar 18, 2019 at 15:52 Michael Banck 
wrote:

> Hi.
>
> Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > * Michael Banck (michael.ba...@credativ.de) wrote:
> > > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > > To be clear, I agree completely that we don't want to be
> reporting false
> > > > > > positives or "this might mean corruption!" to users running the
> tool,
> > > > > > but I haven't seen a good explaination of why this needs to
> involve the
> > > > > > server to avoid that happening.  If someone would like to point
> that out
> > > > > > to me, I'd be happy to go read about it and try to understand.
> > > > >
> > > > > The mentions on this thread that the server has all the facility in
> > > > > place to properly lock a buffer and make sure that a partial read
> > > > > *never* happens and that we *never* have any kind of false
> positives,
> > > >
> > > > Uh, we are, of course, going to have partial reads- we just need to
> > > > handle them appropriately, and that's not hard to do in a way that we
> > > > never have false positives.
> > >
> > > I think the current patch (V13 from
> https://www.postgresql.org/message-i
> > > d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible
> bugs.
> >
> > I think the question here is- do you ever see false positives with this
> > latest version..?  If you are, then that's an issue and we should
> > discuss and try to figure out what's happening.  If you aren't seeing
> > false positives, then it seems like we're done here, right?
>
> What do you mean with false positives here? I've never seen a bogus
> checksum failure, i.e. pg_checksums claiming some checksum is wrong
> cause it only read half of a block or a torn page.
>
> I do see sporadic partial reads and they get treated by the re-check
> logic and (if that is not enough) get tallied up as a skipped block in
> the end.  Is that a false positive in your book?


No, that’s clearer not a false positive.

[...]
>
> > > I have now rebased that patch on top of the pg_verify_checksums ->
> > > pg_checksums renaming, see attached.
> >
> > Thanks for that.  Reading through the code though, I don't entirely
> > understand why we're making things complicated for ourselves by trying
> > to seek and re-read the entire block, specifically this:
>
> [...]
>
> > I would think that we could just do:
> >
> >   insert_location = 0;
> >   r = read(BLCKSIZE - insert_location);
> >   if (r < 0) error();
> >   if (r == 0) EOF detected, move to next
> >   if (r < (BLCKSIZE - insert_location)) {
> > insert_location += r;
> > continue;
> >   }
> >
> > At this point, we should have a full block, do our checks...
>
> Well, we need to read() into some buffer which you have ommitted.


Surely there’s a buffer the read in the existing code is passing in, you
just need to offset by the current pointer, sorry for not being clear.

In other words the read would look more like:

read(fd,buf + insert_ptr, BUFSZ - insert_ptr)

And then you have to reset insert_ptr once you have a full block.

So if we had a short read, and then read the rest of the block via
> (BLCKSIZE - insert_location) wouldn't we have to read that in a second
> buffer and then join the two in order to compute the checksum?  That
> does not sounds simpler to me than just re-reading the block entirely.


No, just read into your existing buffer at the point where the prior
partial read left off...

> Have you seen cases where the kernel will actually return a partial read
> > for something that isn't at the end of the file, and where you could
> > actually lseek past that point and read the next block?  I'd be really
> > curious to see that if you can reproduce it...  I've definitely seen
> > empty pages come back with a claim that the full amount was read, but
> > that's a very different thing.
>
> Well, I've seen partial reads and I have seen very rarely that it will
> continue to read another block afterwards.  If the relation is being
> extended while we check it, it sounds plausible that another block could
> be written before we get to read EOF on the next read() after a partial
> read() so that does not sounds like a bug to me either.


Right, absolutely you can have a partial read during a relation extension
and then come back around and do another read and discover more data,
that’s entirely reasonable and I’ve seen it happen too.

I might be misunderstanding your question though?


Yes, the question was more like this: have you ever seen a read return a
partial result when you know you’re in the middle somewhere of an existing
file and the length of the file hasn’t been changed by something else..?  I
can’t say that I have, when reading from regular files, even in
kernel-error type of conditions due to hardware issues, but I’m open to
being told I’m wro

Re: Offline enabling/disabling of data checksums

2019-03-18 Thread Michael Paquier
On Fri, Mar 15, 2019 at 01:37:27PM +0100, Michael Banck wrote:
> Am Freitag, den 15.03.2019, 21:23 +0900 schrieb Michael Paquier:
>> Perhaps having them under --verbose makes more sense?
> 
> Well if we think it is essential in order to tell the user what happened
> in the case of an error, it shouldn't be verbose I guess.

I would still keep them to be honest.  I don't know, if others find
the tool too chatty we could always rework that part and tune it.

Please find attached an updated patch set, I have rebased that stuff
on top of my recent commits to refactor the control file updates.
While reviewing, I have found a problem in the docs (forgot a 
markup previously), and there was a problem with the parent path fsync
causing an interruption to not return the correct error code, and
actually we should just use durable_rename() in this case (if
--no-sync gets in then pg_mv_file() should be used of course).

I have also been thinking about what we could add in the
documentation, so this version adds a draft to describe the cases
where enabling checksums can lead to corruption when involving
multiple nodes in a cluster and tools doing physical copy of relation
blocks.

I have not done the --no-sync part yet on purpose, as that will most
likely conflict based on the feedback received for this version..
--
Michael
From a85112d87ec4bc4b00d22c105b9958a2c70c3758 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 18 Mar 2019 17:12:15 +0900
Subject: [PATCH] Add options to enable and disable checksums in pg_checksums

An offline cluster can now work with more modes in pg_checksums:
- --enable can enable checksums in a cluster, updating all blocks with a
correct checksum, and update the control file at the end.
- --disable can disable checksums in a cluster, updating the the control
file.
- --check is an extra option able to verify checksums for a cluster.

When using --disable, only the control file is updated and then
flushed.  When using --enable, the process gets more complicated as the
operation can be long:
- Rename the control file to a temporary name, to prevent a parallel
startup of Postgres.
- Scan all files and update their checksums.
- Rename back the control file.
- Flush the data directory.
- Update the control file and then flush it, to make the change
durable.
If the operation is interrupted, the control file gets moved back in
place.

If no mode is specified in the options, then --check is used for
compatibility with older versions of pg_verify_checksums (now renamed to
pg_checksums in v12).

Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.gd4...@nighthawk.caipicrew.dd-dns.de
---
 doc/src/sgml/ref/pg_checksums.sgml|  72 ++-
 src/bin/pg_checksums/pg_checksums.c   | 280 +++---
 src/bin/pg_checksums/t/002_actions.pl |  76 +--
 src/tools/pgindent/typedefs.list  |   1 +
 4 files changed, 381 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..a7f4ef1024 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -16,7 +16,7 @@ PostgreSQL documentation
 
  
   pg_checksums
-  verify data checksums in a PostgreSQL database cluster
+  enable, disable or check data checksums in a PostgreSQL database cluster
  
 
  
@@ -36,10 +36,24 @@ PostgreSQL documentation
  
   Description
   
-   pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   pg_checksums checks, enables or disables data
+   checksums in a PostgreSQL cluster.  The server
+   must be shut down cleanly before running
+   pg_checksums. The exit status is zero if there
+   are no checksum errors when checking them, and nonzero if at least one
+   checksum failure is detected. If enabling or disabling checksums, the
+   exit status is nonzero if the operation failed.
+  
+
+  
+   Checking checksums requires to scan every file holding them in the data
+   folder.  Disabling checksums requires only an update of the file
+   pg_control.  Enabling checksums first renames
+   the file pg_control to
+   pg_control.pg_checksums_in_progress to prevent
+   a parallel startup of the cluster, then it updates all files with
+   checksums, and it finishes by renaming and updating
+   pg_control to mark checksums as enabled.
   
  
 
@@ -60,6 +74,37 @@ PostgreSQL documentation
   
  
 
+ 
+  -c
+  --check
+  
+   
+Checks checksums. This is the default mode if nothing else is
+specified.
+   
+  
+ 
+
+ 
+  -d
+  --disable
+  
+   
+Disables checksums.
+   
+  
+ 
+
+ 
+  -e
+  --enable
+  
+   
+Enables checksums.
+   
+

Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Michael Paquier
On Mon, Mar 18, 2019 at 08:32:44AM +0100, Magnus Hagander wrote:
> On Mon, Mar 18, 2019 at 7:08 AM Stephen Frost  wrote:
>> I definitely think that we should add an option to allow the user to
>> tell us explicitly what they want here, even if we also go based on what
>> the created directory has (and in that case, we should make everything,
>> including the base directory, follow what the user asked for).
> 
> +1 for having an option to override whatever the default is.

Based on the feedback gathered, having a separate option to enforce
the default and not touching the behavior implemented until now,
sounds fine to me.
--
Michael


signature.asc
Description: PGP signature


Re: partitioned tables referenced by FKs

2019-03-18 Thread Amit Langote
Hi,

On 2019/03/15 2:31, Alvaro Herrera wrote:
> Once I was finished, fixed bugs and tested it, I realized that that was
> a stupid thing to have done -- because THOSE ARE DIFFERENT CONSTRAINTS.
> When you say "fk (a) references pk1" you're saying that all the values
> in fk(a) must appear in pk1.  OTOH when you say "fk references pk" you
> mean that the values might appear anywhere in pk, not just pk1.

Sure, then just drop the check trigger that queries only pk1, in favor of
one that checks pk, that's already in place due to the parent constraint.
Actually, if one doesn't drop it (that is, by way of dropping the
constraint that created it), they won't be able to insert into fk anything
but the subset of rows that pk1 allows as a partition; see this:

create table pk1 (a int primary key);
insert into pk1 values (1);
create table pk (a int primary key) partition by list (a);
create table fk (a int, foreign key (a) references pk1, foreign key (a)
references pk);

insert into fk values (1);
ERROR:  insert or update on table "fk" violates foreign key constraint
"fk_a_fkey1"
DETAIL:  Key (a)=(1) is not present in table "pk".

alter table pk attach partition pk1 for values in (1);
insert into fk values (1);

create table pk2 partition of pk for values in (2);
insert into pk values (2);

insert into fk values (2);
ERROR:  insert or update on table "fk" violates foreign key constraint
"fk_a_fkey"
DETAIL:  Key (a)=(2) is not present in table "pk1".

You can no longer insert (2) into pk1 though.

Now it's true that a user can drop the constraint directly referencing pk1
themselves to make the above error go away, but it would've been nice if
they didn't have to do it.  That would be if it was internally converted
into a child constraint when attaching pk1 to pk, as I'm suggesting.

> Have a
> look at the triggers in pg_trigger that appear when you do "references
> pk1" vs. when you do "references pk".  The constraint itself might
> appear identical, but the former adds check triggers that are not
> present for the latter.

Let's consider both triggers.

1. The action trigger on the referenced table does the same thing no
matter whether it's a partition or not, which is this:

select 1 from fk x where $1 = a for key share of x;

Note that the trigger is invoked both when the referenced table is
directly mentioned in the query (delete from pk1) and when it's indirectly
mentioned via its parent table (delete from pk).  Duplication of triggers
in the latter case causes the same set of rows to be added twice for
checking in the respective triggers' contexts, which seems pointless.

2. The check trigger on the referencing table checks exactly the primary
key table that's mentioned in the foreign key constraint.  So, when the
foreign key constraint is "referencing pk1", the check query is:

select 1 from pk1 x where a = $1 for key share of x;

whereas when the foreign key constraint is "referencing pk", it's:

select 1 from pk x where a = $1 for key share of x;

The latter scans appropriate partition of pk, which may be pk1, so we can
say the first check trigger is redundant, and in fact, even annoying as
shown above.


So, when attaching pk1 to pk (where both are referenced by fk using
exactly the same columns and same other constraint attributes), we *can*
reuse the foreign key on fk referencing pk1, as the child constraint of
foreign key on fk referencing pk, whereby, we:

1. don't need to create a new action trigger on pk1, because the existing
one is enough,

2. can drop the check trigger on fk that checks pk1, because the one that
checks pk will be enough

As you know, we've done similar stuff in 123cc697a8eb + cb90de1aac18 for
the case where the partition is on the referencing side.  In that case, we
can reuse the check trigger as it checks the same table in all partitions
and drop the redundant action trigger.

Thanks,
Amit




Re: Parallel query vs smart shutdown and Postmaster death

2019-03-18 Thread Arseny Sher

Thomas Munro  writes:

> Just a thought: instead of the new hand-coded loop you added in
> pmdie(), do you think it would make sense to have a new argument
> "exclude_class_mask" for SignalSomeChildren()?  If we did that, I
> would consider renaming the existing parameter "target" to
> "include_type_mask" to make it super clear what's going on.

I thought that this is a bit too complicated for single use-case, but if
you like it better, here is an updated version.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 2e6359df5bf60b9ab6ca6e35fa347993019526db Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 72 -
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..2ad215b12d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -412,10 +412,10 @@ static void report_fork_failure_to_client(Port *port, int errnum);
 static CAC_state canAcceptConnections(void);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
-static bool SignalSomeChildren(int signal, int targets);
+static bool SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask);
 static void TerminateChildren(int signal);
 
-#define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
+#define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL, 0)
 
 static int	CountChildren(int target);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
@@ -2689,10 +2689,12 @@ pmdie(SIGNAL_ARGS)
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 			{
-/* autovac workers are told to shut down immediately */
-/* and bgworkers too; does this need tweaking? */
-SignalSomeChildren(SIGTERM,
-   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+/*
+ * Shut down all bgws except parallel workers: graceful
+ * termination of existing sessions needs them.
+ * Autovac workers are also told to shut down immediately.
+ */
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER, BGWORKER_CLASS_PARALLEL);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -2752,7 +2754,7 @@ pmdie(SIGNAL_ARGS)
 signal_child(WalReceiverPID, SIGTERM);
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
-SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER, 0);
 
 /*
  * Only startup, bgwriter, walreceiver, possibly bgworkers,
@@ -2773,7 +2775,7 @@ pmdie(SIGNAL_ARGS)
 /* shut down all backends and workers */
 SignalSomeChildren(SIGTERM,
    BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC |
-   BACKEND_TYPE_BGWORKER);
+   BACKEND_TYPE_BGWORKER, 0);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -3939,26 +3941,52 @@ signal_child(pid_t pid, int signal)
 
 /*
  * Send a signal to the targeted children (but NOT special children;
- * dead_end children are never signaled, either).
+ * dead_end children are never signaled, either). If BACKEND_TYPE_BGWORKER
+ * is in include_type_mask, bgworkers are additionaly filtered out by
+ * exclude_bgw_mask.
  */
 static bool
-SignalSomeChildren(int signal, int target)
+SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask)
 {
 	dlist_iter	iter;
+	slist_iter	siter;
 	bool		signaled = false;
 
+	/*
+	 * Handle bgws by walking over BackgroundWorkerList because we have
+	 * to check out bgw class.
+	 */
+	if ((include_type_mask & BACKEND_TYPE_BGWORKER) != 0)
+	{
+		slist_foreach(siter, &BackgroundWorkerList)
+		{
+			RegisteredBgWorker *rw;
+
+			rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+			if (rw->rw_pid > 0 &&
+((rw->rw_worker.bgw_flags & exclude_bgw_mask) == 0))
+			{
+ereport(DEBUG4,
+		(errmsg_internal("sending signal %d to process %d",
+		 signal, (int) rw->rw_pid)));
+signal_child(rw->rw_pid, signal);
+signaled = true;
+			}
+		}
+	}
+
 	dlist_foreach(iter, &BackendList)
 	{
 		Backend*bp = dlist_container(Backend, elem, iter.cur);
 
-		if (bp->dead_end)
+		if (bp->dead_end || bp->bkend_type == BACKEND_TYPE_BGWORKER)
 			continue;
 
 		/*
 		 * Since target == BACKEND_TYPE_ALL is the most common case, we test
 		 * it first and avoid touching shared memory for every child.
 		 */
-		if (target != BACKEND_TYPE_ALL)
+		if (i

Re: Tid scan improvements

2019-03-18 Thread David Rowley
On Fri, 15 Mar 2019 at 18:42, Edmund Horner  wrote:
> I've had to adapt it to use the table scan API.  I've got it compiling
> and passing tests, but I'm uneasy about some things that still use the
> heapam API.
>
> 1. I call heap_setscanlimits as I'm not sure there is a tableam equivalent.
> 2. I'm not sure whether non-heap tableam implementations can also be
> supported by my TID Range Scan: we need to be able to set the scan
> limits.  There may not be any other implementations yet, but when
> there are, how do we stop the planner using a TID Range Scan for
> non-heap relations?
> 3. When fetching tuples, I see that nodeSeqscan.c uses
> table_scan_getnextslot, which saves dealing with HeapTuples.  But
> nodeTidrangescan wants to do some checking of the block and offset
> before returning the slot.  So I have it using heap_getnext and
> ExecStoreBufferHeapTuple.  Apart from being heapam-specific, it's just
> not as clean as the new API calls.
>
> Ideally, we can get to to support general tableam implementations
> rather than using heapam-specific calls.  Any advice on how to do
> this?

The commit message in 8586bf7ed mentions:

> Subsequent commits will incrementally abstract table access
> functionality to be routed through table access methods. That change
> is too large to be reviewed & committed at once, so it'll be done
> incrementally.

and looking at [1] I see patch 0004 introduces some changes in
nodeTidscan.c to call a new tableam API function named
heapam_fetch_row_version. I see this function does have a ItemPointer
argument, so I guess we must be keeping those as unique row
identifiers in the API.

Patch 0001 does change the signature of heap_setscanlimits() (appears
to be committed already), and then in 0010 the only code that calls
heap_setscanlimits() (IndexBuildHeapRangeScan()) is moved and renamed
to heapam_index_build_range_scan() and set to be called via the
index_build_range_scan TableAmRoutine method.  So it looks like out of
that patch series nothing is there to allow you to access
heap_setscanlimits() directly via the TableAmRoutine API, so perhaps
for this to work heap_setscanlimits will need to be interfaced,
however, I'm unsure if that'll violate any assumptions that Andres
wants to keep out of the API...  Andres?

[1] 
https://www.postgresql.org/message-id/20190311193746.hhv4e4e62nxtq...@alap3.anarazel.de

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Problem with default partition pruning

2019-03-18 Thread Amit Langote
Hosoya-san,

On 2019/03/15 15:05, Yuzuko Hosoya wrote:
> Indeed, it's problematic.  I also did test and I found that 
> this problem was occurred when any partition didn't match 
> WHERE clauses.  So following query didn't work correctly.
> 
> # explain select * from test1_3 where (id > 0 and id < 30);
>QUERY PLAN   
> -
>  Append  (cost=0.00..58.16 rows=12 width=36)
>->  Seq Scan on test1_3_1  (cost=0.00..29.05 rows=6 width=36)
>  Filter: ((id > 0) AND (id < 30))
>->  Seq Scan on test1_3_2  (cost=0.00..29.05 rows=6 width=36)
>  Filter: ((id > 0) AND (id < 30))
> (5 rows)
> 
> I created a new patch to handle this problem, and confirmed
> the query you mentioned works as expected
> 
> # explain select * from test1 where (id > 0 and id < 30) or (id > 220 and id 
> < 230);
> QUERY PLAN 
> ---
>  Append  (cost=0.00..70.93 rows=26 width=36)
>->  Seq Scan on test1_1_1  (cost=0.00..35.40 rows=13 width=36)
>  Filter: (((id > 0) AND (id < 30)) OR ((id > 220) AND (id < 230)))
>->  Seq Scan on test1_3_1  (cost=0.00..35.40 rows=13 width=36)
>  Filter: (((id > 0) AND (id < 30)) OR ((id > 220) AND (id < 230)))
> (5 rows)
> 
> v2 patch attached.
> Could you please check it again?

I think the updated patch breaks the promise that
get_matching_range_bounds won't set scan_default based on individual
pruning value comparisons.  How about the attached delta patch that
applies on top of your earlier v1 patch, which fixes the issue reported by
Thibaut?

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 8317318156..e64fd1cea6 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -2759,15 +2759,19 @@ get_matching_range_bounds(PartitionPruneContext 
*context,
 * instead as the offset of the upper bound of 
the greatest
 * partition that may contain lookup value.  If 
the lookup
 * value had exactly matched the bound, but it 
isn't
-* inclusive, no need add the adjacent 
partition.  If 'off' is
-* -1 indicating that all bounds are greater, 
then we simply
-* end up adding the first bound's offset, that 
is, 0.
+* inclusive, no need add the adjacent 
partition.
 */
-   else if (off < 0 || !is_equal || inclusive)
+   else if (!is_equal || inclusive)
maxoff = off + 1;
else
maxoff = off;
}
+   else
+   /*
+* 'off' is -1 indicating that all bounds are 
greater, so just
+* set the first bound's offset as maxoff.
+*/
+   maxoff = off + 1;
break;
 
default:


Re: speeding up planning with partitions

2019-03-18 Thread Amit Langote
Imai-san,

On 2019/03/15 14:40, Imai, Yoshikazu wrote:
> Amit-san,
> 
> I have another little comments about v31-patches.
> 
> * We don't need is_first_child in inheritance_planner().
> 
> On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
>> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
>>> I attached the diff of modification for v26-0003 patch which also
>> contains some refactoring.
>>> Please see if it is ok.
>>
>> I like the is_first_child variable which somewhat improves readability,
>> so updated the patch to use it.
> 
> I noticed that detecting first child query in inheritance_planner() is 
> already done by "final_rtable == NIL"
> 
> /*
>  * If this is the first non-excluded child, its post-planning rtable
>  * becomes the initial contents of final_rtable; otherwise, append
>  * just its modified subquery RTEs to final_rtable.
>  */
> if (final_rtable == NIL)
> 
> So I think we can use that instead of using is_first_child.

That's a good suggestion.  One problem I see with that is that
final_rtable is set only once we've found the first *non-dummy* child.
So, if all the children except the last one are dummy, we'd end up never
reusing the source child objects.  Now, if final_rtable was set for the
first child irrespective of whether it turned out to be dummy or not,
which it seems OK to do, then we can implement your suggestion.

Thanks,
Amit




Re: [HACKERS] Block level parallel vacuum

2019-03-18 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 18 Mar 2019 11:54:42 +0900, Masahiko Sawada  
wrote in 
> Here is the performance test results. I've setup a 500MB table with
> several indexes and made 10% of table dirty before each vacuum.
> Compared execution time of the patched postgrse with the current HEAD
> (at 'speed_up' column). In my environment,
> 
>  indexes | parallel_degree |  patched   |head| speed_up
> -+-+++--
>   0 |   0 |   238.2085 |   244.7625 |   1.0275
>   0 |   1 |   237.7050 |   244.7625 |   1.0297
>   0 |   2 |   238.0390 |   244.7625 |   1.0282
>   0 |   4 |   238.1045 |   244.7625 |   1.0280
>   0 |   8 |   237.8995 |   244.7625 |   1.0288
>   0 |  16 |   237.7775 |   244.7625 |   1.0294
>   1 |   0 |  1328.8590 |  1334.9125 |   1.0046
>   1 |   1 |  1325.9140 |  1334.9125 |   1.0068
>   1 |   2 |  1333.3665 |  1334.9125 |   1.0012
>   1 |   4 |  1329.5205 |  1334.9125 |   1.0041
>   1 |   8 |  1334.2255 |  1334.9125 |   1.0005
>   1 |  16 |  1335.1510 |  1334.9125 |   0.9998
>   2 |   0 |  2426.2905 |  2427.5165 |   1.0005
>   2 |   1 |  1416.0595 |  2427.5165 |   1.7143
>   2 |   2 |  1411.6270 |  2427.5165 |   1.7197
>   2 |   4 |  1411.6490 |  2427.5165 |   1.7196
>   2 |   8 |  1410.1750 |  2427.5165 |   1.7214
>   2 |  16 |  1413.4985 |  2427.5165 |   1.7174
>   4 |   0 |  4622.5060 |  4619.0340 |   0.9992
>   4 |   1 |  2536.8435 |  4619.0340 |   1.8208
>   4 |   2 |  2548.3615 |  4619.0340 |   1.8126
>   4 |   4 |  1467.9655 |  4619.0340 |   3.1466
>   4 |   8 |  1486.3155 |  4619.0340 |   3.1077
>   4 |  16 |  1481.7150 |  4619.0340 |   3.1174
>   8 |   0 |  9039.3810 |  8990.4735 |   0.9946
>   8 |   1 |  4807.5880 |  8990.4735 |   1.8701
>   8 |   2 |  3786.7620 |  8990.4735 |   2.3742
>   8 |   4 |  2924.2205 |  8990.4735 |   3.0745
>   8 |   8 |  2684.2545 |  8990.4735 |   3.3493
>   8 |  16 |  2672.9800 |  8990.4735 |   3.3635
>  16 |   0 | 17821.4715 | 17740.1300 |   0.9954
>  16 |   1 |  9318.3810 | 17740.1300 |   1.9038
>  16 |   2 |  7260.6315 | 17740.1300 |   2.4433
>  16 |   4 |  5538.5225 | 17740.1300 |   3.2030
>  16 |   8 |  5368.5255 | 17740.1300 |   3.3045
>  16 |  16 |  5291.8510 | 17740.1300 |   3.3523
> (36 rows)

For indexes=4,8,16, the cases with parallel_degree=4,8,16 behave
almost the same. I suspect that the indexes are too-small and all
the index pages were on memory and CPU is saturated. Maybe you
had four cores and parallel workers more than the number had no
effect.  Other normal backends should have been able do almost
nothing meanwhile. Usually the number of parallel workers is
determined so that IO capacity is filled up but this feature
intermittently saturates CPU capacity very under such a
situation.

I'm not sure, but what if we do index vacuum in one-tuple-by-one
manner? That is, heap vacuum passes dead tuple one-by-one (or
buffering few tuples) to workers and workers process it not by
bulkdelete, but just tuple_delete (we don't have one). That could
avoid the sleep time of heap-scan while index bulkdelete.


> Attached the updated version patches. The patches apply to the current
> HEAD cleanly but the 0001 patch still changes the vacuum option to a
> Node since it's under the discussion. After the direction has been
> decided, I'll update the patches.

As for the to-be-or-not-to-be a node problem, I don't think it is
needed but from the point of consistency, it seems reasonable and
it is seen in other nodes that *Stmt Node holds option Node. But
makeVacOpt and it's usage, and subsequent operations on the node
look somewhat strange.. Why don't you just do
"makeNode(VacuumOptions)"?


>+  /* Estimate size for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_TUPLES */
>+maxtuples = compute_max_dead_tuples(nblocks, nindexes > 0);

If I understand this correctly, nindexes is always > 1 there. At
lesat asserted that > 0 there.

>+  estdt = MAXALIGN(add_size(sizeof(LVDeadTuples),

I don't think the name is good. (dt menant detach by the first look for me..)

>+if (lps->nworkers_requested > 0)
>+appendStringInfo(&buf,
>+ ngettext("launched %d parallel vacuum worker for 
>index cleanup (planned: %d, requested %d)",

"planned"?


>+/* Get the next index to vacuum */
>+if (do_parallel)
>+idx = pg_atomic_fetch_add_u32(&(lps->lvshared->nprocessed), 1);
>+else
>+ 

Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Kyotaro HORIGUCHI
At Mon, 18 Mar 2019 17:16:01 +0900, Michael Paquier  wrote 
in <20190318081601.gi1...@paquier.xyz>
> On Mon, Mar 18, 2019 at 08:32:44AM +0100, Magnus Hagander wrote:
> > On Mon, Mar 18, 2019 at 7:08 AM Stephen Frost  wrote:
> >> I definitely think that we should add an option to allow the user to
> >> tell us explicitly what they want here, even if we also go based on what
> >> the created directory has (and in that case, we should make everything,
> >> including the base directory, follow what the user asked for).
> > 
> > +1 for having an option to override whatever the default is.
> 
> Based on the feedback gathered, having a separate option to enforce
> the default and not touching the behavior implemented until now,
> sounds fine to me.

FWIW +1 from me. That would be like cp -p.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: chained transactions

2019-03-18 Thread Peter Eisentraut
Updated patch.  I have squashed the two previously separate patches
together in this one.

On 2019-01-06 15:14, Fabien COELHO wrote:
> I do not understand the value of the SAVEPOINT in the tests.

The purpose of the SAVEPOINT in the test is because it exercises
different switch cases in CommitTransactionCommand() and
AbortCurrentTransaction().  It's not entirely comprehensible from the
outside, but code coverage analysis confirms it.

> Otherwise I'm okay with this patch.
> 
> About the second patch, I'm still unhappy with functions named commit & 
> rollback doing something else, which result in somehow strange code, where 
> you have to guess that the transaction is restarted in all cases, either 
> within the commit function or explicitely.

I have updated the SPI interface with your suggestions.  I agree it's
better that way.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 87a1c30fedddeb83f7a84a0c8f7d012a0df43814 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 18 Mar 2019 10:52:12 +0100
Subject: [PATCH v5] Transaction chaining

Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.

Support for transaction chaining in PL/pgSQL is also added.

Discussion: 
https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645...@2ndquadrant.com
---
 doc/src/sgml/plpgsql.sgml |   9 +
 doc/src/sgml/ref/abort.sgml   |  14 +-
 doc/src/sgml/ref/commit.sgml  |  19 +-
 doc/src/sgml/ref/end.sgml |  14 +-
 doc/src/sgml/ref/rollback.sgml|  19 +-
 doc/src/sgml/spi.sgml |  25 ++-
 src/backend/access/transam/xact.c |  73 ++-
 src/backend/catalog/sql_features.txt  |   2 +-
 src/backend/executor/spi.c|  50 -
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/parser/gram.y |  19 +-
 src/backend/tcop/utility.c|   4 +-
 src/bin/psql/tab-complete.c   |   8 +-
 src/include/access/xact.h |   6 +-
 src/include/executor/spi.h|   2 +
 src/include/nodes/parsenodes.h|   1 +
 .../src/expected/plpgsql_transaction.out  |  28 +++
 src/pl/plpgsql/src/pl_exec.c  |  18 +-
 src/pl/plpgsql/src/pl_funcs.c |  10 +-
 src/pl/plpgsql/src/pl_gram.y  |  18 +-
 src/pl/plpgsql/src/pl_unreserved_kwlist.h |   2 +
 src/pl/plpgsql/src/plpgsql.h  |   2 +
 .../plpgsql/src/sql/plpgsql_transaction.sql   |  23 +++
 src/test/regress/expected/transactions.out| 191 ++
 src/test/regress/sql/transactions.sql |  63 ++
 26 files changed, 584 insertions(+), 38 deletions(-)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index f8c6435c50..eacd67d8b6 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -3490,6 +3490,15 @@ Transaction Management
 

 
+   
+A new transaction starts out with default transaction characteristics such
+as transaction isolation level.  In cases where transactions are committed
+in a loop, it might be desirable to start new transactions automatically
+with the same characteristics as the previous one.  The commands
+COMMIT AND CHAIN and ROLLBACK AND
+CHAIN accomplish this.
+   
+

 Transaction control is only possible in CALL or
 DO invocations from the top level or nested
diff --git a/doc/src/sgml/ref/abort.sgml b/doc/src/sgml/ref/abort.sgml
index 21799d2a83..0372913365 100644
--- a/doc/src/sgml/ref/abort.sgml
+++ b/doc/src/sgml/ref/abort.sgml
@@ -21,7 +21,7 @@
 
  
 
-ABORT [ WORK | TRANSACTION ]
+ABORT [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
 
  
 
@@ -51,6 +51,18 @@ Parameters
  
 

+
+   
+AND CHAIN
+
+ 
+  If AND CHAIN is specified, a new transaction is
+  immediately started with the same transaction characteristics (see ) as the just finished one.  Otherwise,
+  no new transaction is started.
+ 
+
+   
   
  
 
diff --git a/doc/src/sgml/ref/commit.sgml b/doc/src/sgml/ref/commit.sgml
index b2e8d5d180..96a018e6aa 100644
--- a/doc/src/sgml/ref/commit.sgml
+++ b/doc/src/sgml/ref/commit.sgml
@@ -21,7 +21,7 @@
 
  
 
-COMMIT [ WORK | TRANSACTION ]
+COMMIT [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
 
  
 
@@ -48,6 +48,18 @@ Parameters
  
 

+
+   
+AND CHAIN
+
+ 
+  If AND CHAIN is specified, a new transaction is
+  immediately started with the same transaction characteristics (see ) as the just finished one.  Otherwise,
+  no new transaction is started.
+ 
+
+   
   
  
 
@@ -79,9 +91,8 @@ Examples
   

Re: chained transactions

2019-03-18 Thread Peter Eisentraut
On 2019-02-16 06:22, Andres Freund wrote:
>> +static int  save_XactIsoLevel;
>> +static bool save_XactReadOnly;
>> +static bool save_XactDeferrable;
> 
> We normally don't define variables in the middle of a file?  Also, why
> do these need to be global vars rather than defined where we do
> chaining? I'm imagining a SavedTransactionState struct declared on the
> stack that's then passed to Save/Restore?
> 
> Not crucial, but I do wonder if we can come up with a prettier approach
> for this.

If we do it with a struct that is passed to the functions, then either
the struct contents will be available outside of xact.c, which will
expose internals of xact.c that weren't previously exposed, or we do it
with an incomplete struct, but then this would involve palloc across
transaction commit, resulting in additional complexity.  Neither of
these sound like obviously better alternatives.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: explain plans with information about (modified) gucs

2019-03-18 Thread Rafia Sabih
On Sun, 24 Feb 2019 at 00:06, Tomas Vondra  wrote:
>
> Hi,
>
> attached is an updated patch, fixing and slightly tweaking the docs.
>
>
> Barring objections, I'll get this committed later next week.
>
I was having a look at this patch, and this kept me wondering,

+static void
+ExplainShowSettings(ExplainState *es)
+{
Is there some reason for not providing any explanation above this
function just like the rest of the functions in this file?

Similarly, for

struct config_generic **
get_explain_guc_options(int *num)
{

/* also bail out of there are no options */
+ if (!num)
+ return;
I think you meant 'if' instead if 'of' here.



Re: [HACKERS] CLUSTER command progress monitor

2019-03-18 Thread Rafia Sabih
On Fri, 8 Mar 2019 at 09:14, Tatsuro Yamada
 wrote:
>
> On 2019/03/06 15:38, Tatsuro Yamada wrote:
> > On 2019/03/05 17:56, Tatsuro Yamada wrote:
> >> On 2019/03/05 11:35, Robert Haas wrote:
> >>> On Mon, Mar 4, 2019 at 5:38 AM Tatsuro Yamada
> >>>  wrote:
>  === Current design ===
> 
>  CLUSTER command uses Index Scan or Seq Scan when scanning the heap.
>  Depending on which one is chosen, the command will proceed in the
>  following sequence of phases:
> 
>  * Scan method: Seq Scan
>    0. initializing (*2)
>    1. seq scanning heap(*1)
>    3. sorting tuples   (*2)
>    4. writing new heap (*1)
>    5. swapping relation files  (*2)
>    6. rebuilding index (*2)
>    7. performing final cleanup (*2)
> 
>  * Scan method: Index Scan
>    0. initializing (*2)
>    2. index scanning heap  (*1)
>    5. swapping relation files  (*2)
>    6. rebuilding index (*2)
>    7. performing final cleanup (*2)
> 
>  VACUUM FULL command will proceed in the following sequence of phases:
> 
>    1. seq scanning heap(*1)
>    5. swapping relation files  (*2)
>    6. rebuilding index (*2)
>    7. performing final cleanup (*2)
> 
>  (*1): increasing the value in heap_tuples_scanned column
>  (*2): only shows the phase in the phase column
> >>>
> >>> All of that sounds good.
> >>>
>  The view provides the information of CLUSTER command progress details as 
>  follows
>  # \d pg_stat_progress_cluster
>  View "pg_catalog.pg_stat_progress_cluster"
>  Column   |  Type   | Collation | Nullable | Default
>  ---+-+---+--+-
> pid   | integer |   |  |
> datid | oid |   |  |
> datname   | name|   |  |
> relid | oid |   |  |
> command   | text|   |  |
> phase | text|   |  |
> cluster_index_relid   | bigint  |   |  |
> heap_tuples_scanned   | bigint  |   |  |
> heap_tuples_vacuumed  | bigint  |   |  |
> >>>
> >>> Still not sure if we need heap_tuples_vacuumed.  We could try to
> >>> report heap_blks_scanned and heap_blks_total like we do for VACUUM, if
> >>> we're using a Seq Scan.
> >>
> >> I have no strong opinion to add heap_tuples_vacuumed, so I'll remove that 
> >> in
> >> next patch.
> >>
> >> Regarding heap_blks_scanned and heap_blks_total, I suppose that it is able 
> >> to
> >> get those from initscan(). I'll investigate it more.
> >>
> >> cluster.c
> >>copy_heap_data()
> >>  heap_beginscan()
> >>heap_beginscan_internal()
> >>  initscan()
> >>
> >>
> >>
>  === Discussion points ===
> 
> - Progress counter for "3. sorting tuples" phase
>    - Should we add pgstat_progress_update_param() in tuplesort.c like 
>  a
>  "trace_sort"?
>  Thanks to Peter Geoghegan for the useful advice!
> >>>
> >>> How would we avoid an abstraction violation?
> >>
> >> Hmm... What do you mean an abstraction violation?
> >> If it is difficult to solve, I'd not like to add the progress counter for 
> >> the sorting tuples.
> >>
> >>
> - Progress counter for "6. rebuilding index" phase
>    - Should we add "index_vacuum_count" in the view like a vacuum 
>  progress monitor?
>  If yes, I'll add pgstat_progress_update_param() to 
>  reindex_relation() of index.c.
>  However, I'm not sure whether it is okay or not.
> >>>
> >>> Doesn't seem unreasonable to me.
> >>
> >> I see, I'll add it later.
> >
> >
> > Attached file is revised and WIP patch including:
> >
> >- Remove heap_tuples_vacuumed
> >- Add heap_blks_scanned and heap_blks_total
> >- Add index_vacuum_count
> >
> > I tried to "add heap_blks_scanned and heap_blks_total" columns and I 
> > realized that
> > "heap_tuples_scanned" column is suitable as a counter when a scan method is
> > both index-scan and seq-scan because CLUSTER is on a tuple basis.
>
>
> Attached file is rebased patch on current HEAD.
> I changed a status. :)
>
>
Looks like the patch needs a rebase.
I was on the commit fb5806533f9fe0433290d84c9b019399cd69e9c2

PFA reject file in case you want to have a look.
> Regards,
> Tatsuro Yamada
>
>
>


-- 
Regards,
Rafia Sabih
--- src/backend/commands/cluster.c
+++ src/backend/commands/cluster.c
@@ -942,14 +960,33 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOld

Re: Add exclusive backup deprecation notes to documentation

2019-03-18 Thread Peter Eisentraut
On 2019-03-07 10:33, David Steele wrote:
> On 3/1/19 3:14 PM, Laurenz Albe wrote:
>> Magnus Hagander wrote:
>>> Maybe have the first note say "This method is deprecated bceause it has 
>>> serious
>>> risks (see bellow)" and then list the actual risks at the end?
>>
>> Good idea.  That may attract the attention of the dogs among the readers.
> 
> OK, here's a new version that splits the deprecation notes from the 
> discussion of risks.  I also fixed the indentation.

The documentation changes appear to continue the theme from the other
thread that the exclusive backup mode is terrible and everyone should
feel bad about it.  I don't think there is consensus about that.

I do welcome a more precise description of the handling of backup_label
and a better hint in the error message.  I think we haven't gotten to
the final shape there yet, especially for the latter.  I suggest to
focus on that.

The other changes repeat points already made in nearby documentation.

I think it would be helpful to frame the documentation in a way to
suggest that the nonexclusive mode is more for automation and more
sophisticated tools and the exclusive mode is more for manual or simple
scripted use.

If we do think that the exclusive mode will be removed in PG13, then I
don't think we need further documentation changes.  It already says it's
deprecated, and we don't need to justify that at length.  But again, I'm
not convinced that that will happen.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Peter Eisentraut
On 2019-03-16 15:29, Robert Haas wrote:
> Another option would be to provide a pg_basebackup option to allow the
> user to specify what they intended i.e.  --[no-]group-read.  (Tying it
> to -R doesn't sound like a good decision to me.)

I was actually surprised to learn how it works right now.  I would have
preferred that pg_basebackup require an explicit option to turn on group
access, similar to initdb.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Peter Eisentraut
On 2019-03-18 08:32, Magnus Hagander wrote:
> Going based on the current setting of the directory seems defensible to
> me, with the argument of "we trust you created the directory the way you
> want the rest of the system to be".
> 
> 
> Which I believe is also how a plain unix cp (or tar or whatever) would
> work, isn't it? I think that alone is a pretty strong reason to work the
> same as those -- they're not entirely unsimilar.

Those don't copy over the network.  In the case of pg_basebackup, there
is nothing that ensures that the remote system has the same users or
groups or permission setup.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unaccent extension python script Issue in Windows

2019-03-18 Thread Hugh Ranalli
On Mon, 18 Mar 2019 at 01:14, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> This patch contains irrelevant changes. The minimal required
> change would be the attached. If you want refacotor the
> UnicodeData reader or rearrange import sutff, it should be
> separate patches.
>
I'm not sure I'd classify the second change as "irrelevant." Using "with"
is the standard and recommended practice for working with files in Python.
At the moment the script does nothing to close the open data file, whether
through regular processing or in the case of an exception. I would argue
that's a bug and should be fixed. Creating a separate patch for that seems
to be adding work for no reason.

Hugh


Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 4:16 AM Michael Paquier  wrote:
> On Mon, Mar 18, 2019 at 08:32:44AM +0100, Magnus Hagander wrote:
> > On Mon, Mar 18, 2019 at 7:08 AM Stephen Frost  wrote:
> >> I definitely think that we should add an option to allow the user to
> >> tell us explicitly what they want here, even if we also go based on what
> >> the created directory has (and in that case, we should make everything,
> >> including the base directory, follow what the user asked for).
> >
> > +1 for having an option to override whatever the default is.
>
> Based on the feedback gathered, having a separate option to enforce
> the default and not touching the behavior implemented until now,
> sounds fine to me.

That's not what I'm proposing.  I think the behavior implemented until
now is not best, because the files within the directory should inherit
the directory's permissions, not the remote side's permissions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: partitioned tables referenced by FKs

2019-03-18 Thread Alvaro Herrera
Hi Amit

On 2019-Mar-18, Amit Langote wrote:

> On 2019/03/15 2:31, Alvaro Herrera wrote:
> > Once I was finished, fixed bugs and tested it, I realized that that was
> > a stupid thing to have done -- because THOSE ARE DIFFERENT CONSTRAINTS.
> > When you say "fk (a) references pk1" you're saying that all the values
> > in fk(a) must appear in pk1.  OTOH when you say "fk references pk" you
> > mean that the values might appear anywhere in pk, not just pk1.
> 
> Sure, then just drop the check trigger that queries only pk1, in favor of
> one that checks pk, that's already in place due to the parent constraint.
> Actually, if one doesn't drop it (that is, by way of dropping the
> constraint that created it), they won't be able to insert into fk anything
> but the subset of rows that pk1 allows as a partition;

That's true, and it is the correct behavior.  What they should be doing
(to prevent the insertion into the referencing relation from failing all
the time) is drop the constraint, as you suggest.

Converting the constraint so that it refers to a completely different
set of permitted values is second-guessing the user about what they
wanted to do; what if we get it wrong, and they really wanted the FK to
reference just exactly the subset of values that they specified, and not
a much larger superset of that?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Possible to modify query language in an extension?

2019-03-18 Thread Robert Haas
On Sun, Mar 17, 2019 at 12:21 AM Tom Lane  wrote:
> Chris Cleveland  writes:
> > I'd like to add some keywords/clauses to the SELECT statement.
>
> Yeah, you'll have to modify gram.y (and a pile of other places)
> if you want to do that.  That's certainly something we do all
> the time, but bison doesn't provide any way to add grammar
> productions on-the-fly, so it does imply core-code mods.
>
> > ... The new SELECT would return multiple result sets.
>
> And that sounds like you'd also be redefining the wire protocol,
> hence having to touch client-side code as well as the server.

Long story short, this sounds like a VERY hard project.  Chris, you
will probably want to think about some other approach to achieving
your objective, because this sounds like a project that even an expert
coder would spend a lot of time trying to get done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Timeout parameters

2019-03-18 Thread Robert Haas
On Sun, Mar 17, 2019 at 9:08 PM Jamison, Kirk  wrote:
> The main argument here is about the security risk of allowing socket timeout
> to cancel valid connections, right?

I don't think so.  I think it's just a weirdly-design parameter
without a really compelling use case.  Enforcing limits on the value
of the parameter doesn't fix that.  Most of the reviewers who have
opined so far have been somewhere between cautious and negative about
the value of that parameter, so I think we should just not add it.  At
least for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Compressed TOAST Slicing

2019-03-18 Thread Tom Lane
Stephen Frost  writes:
> * Andres Freund (and...@anarazel.de) wrote:
>> I don't think that should stop us from breaking the API. You've got to
>> do quite low level stuff to need pglz directly, in which case such an
>> API change should be the least of your problems between major versions.

> Agreed, this is across a major version and I don't think it's an issue
> to break the API.

Yeah.  We don't normally hesitate to change internal APIs across major
versions, as long as
(a) the breakage will be obvious when recompiling an extension, and
(b) it will be clear how to get the same behavior as before.

Adding an argument qualifies on both counts.  Sometimes, if a very
large number of call sites would be affected, it makes sense to use
a wrapper function so that we don't have to touch so many places;
but that doesn't apply here.

regards, tom lane



Re: [HACKERS] generated columns

2019-03-18 Thread Pavel Stehule
Hi

po 18. 3. 2019 v 8:35 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> Here is an updated patch with just the "stored" functionality, as
> discussed.
>
> The actual functionality is much smaller now, contained in the executor.
>  Everything else is mostly DDL support, trigger handling, and some
> frontend stuff.
>

probably I found a bug

create table foo(id int, name text);
insert into foo values(1, 'aaa');
alter table foo add column name_upper text generated always as
(upper(name)) stored;
update foo set name = 'bbb' where id = 1; -- ok

alter table foo drop column name_upper;
update foo set name = 'bbbx' where id = 1; -- ok

alter table foo add column name_upper text generated always as
(upper(name)) stored;
update foo set name = 'bbbxx' where id = 1; -- error

postgres=# update foo set name = 'bbbxx' where id = 1; -- error
ERROR:  no generation expression found for column number 3 of table "foo"

Regards

Pavel








> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Compressed TOAST Slicing

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 10:14 AM Tom Lane  wrote:
> Stephen Frost  writes:
> > * Andres Freund (and...@anarazel.de) wrote:
> >> I don't think that should stop us from breaking the API. You've got to
> >> do quite low level stuff to need pglz directly, in which case such an
> >> API change should be the least of your problems between major versions.
>
> > Agreed, this is across a major version and I don't think it's an issue
> > to break the API.
>
> Yeah.  We don't normally hesitate to change internal APIs across major
> versions, as long as
> (a) the breakage will be obvious when recompiling an extension, and
> (b) it will be clear how to get the same behavior as before.
>
> Adding an argument qualifies on both counts.  Sometimes, if a very
> large number of call sites would be affected, it makes sense to use
> a wrapper function so that we don't have to touch so many places;
> but that doesn't apply here.

+1.  I think Paul had it right originally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unduly short fuse in RequestCheckpoint

2019-03-18 Thread Robert Haas
On Sun, Mar 17, 2019 at 3:41 PM Tom Lane  wrote:
>  So I think we should just blow off that
> hypothetical possibility and do it like this.

Makes sense to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Compressed TOAST Slicing

2019-03-18 Thread Paul Ramsey


> On Mar 18, 2019, at 7:34 AM, Robert Haas  wrote:
> 
> On Mon, Mar 18, 2019 at 10:14 AM Tom Lane  wrote:
>> Stephen Frost  writes:
>>> * Andres Freund (and...@anarazel.de) wrote:
 I don't think that should stop us from breaking the API. You've got to
 do quite low level stuff to need pglz directly, in which case such an
 API change should be the least of your problems between major versions.
>> 
>>> Agreed, this is across a major version and I don't think it's an issue
>>> to break the API.
>> 
>> Yeah.  We don't normally hesitate to change internal APIs across major
>> versions, as long as
>> (a) the breakage will be obvious when recompiling an extension, and
>> (b) it will be clear how to get the same behavior as before.
>> 
>> Adding an argument qualifies on both counts.  Sometimes, if a very
>> large number of call sites would be affected, it makes sense to use
>> a wrapper function so that we don't have to touch so many places;
>> but that doesn't apply here.
> 
> +1.  I think Paul had it right originally.

In that spirit, here is a “one pglz_decompress function, new parameter” version 
for commit.
Thanks!
P


compressed-datum-slicing-20190318a.patch
Description: Binary data


Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 3:04 AM Mithun Cy  wrote:
> autoprewarm waorker should not be restarted. As per the code 
> @apw_start_database_worker@ master starts a worker per database and wait 
> until it exit by calling WaitForBackgroundWorkerShutdown.  The call 
> WaitForBackgroundWorkerShutdown cannot handle the case if the worker was 
> restarted. The WaitForBackgroundWorkerShutdown() get the status BGWH_STOPPED 
> from the call GetBackgroundWorkerPid() if worker got restarted. So master 
> will next detach the shared memory and next restarted worker keep failing 
> going in a unending loop.

Ugh, that seems like a silly oversight.  Does it fix the reported problem?

If I understand correctly, the commit message would be something like this:

==
Don't auto-restart per-database autoprewarm workers.

We should try to prewarm each database only once.  Otherwise, if
prewarming fails for some reason, it will just keep retrying in an
infnite loop.  The existing code was intended to implement this
behavior, but because it neglected to set worker.bgw_restart_time, the
per-database workers keep restarting, contrary to what was intended.

Mithun Cy, per a report from Hans Buschmann
==

Does that sound right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Peter Eisentraut
On 2019-03-18 14:47, Robert Haas wrote:
>> Based on the feedback gathered, having a separate option to enforce
>> the default and not touching the behavior implemented until now,
>> sounds fine to me.
> That's not what I'm proposing.  I think the behavior implemented until
> now is not best, because the files within the directory should inherit
> the directory's permissions, not the remote side's permissions.

I'm strongly in favor of keeping initdb and pg_basebackup options
similar and consistent.  They are both ways to initialize data directories.

You'll note that initdb does not behave the way you describe.  It's not
unreasonable behavior, but it's not the way it currently works.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_basebackup ignores the existing data directory permissions

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 11:36 AM Peter Eisentraut
 wrote:
> On 2019-03-18 14:47, Robert Haas wrote:
> >> Based on the feedback gathered, having a separate option to enforce
> >> the default and not touching the behavior implemented until now,
> >> sounds fine to me.
> > That's not what I'm proposing.  I think the behavior implemented until
> > now is not best, because the files within the directory should inherit
> > the directory's permissions, not the remote side's permissions.
>
> I'm strongly in favor of keeping initdb and pg_basebackup options
> similar and consistent.  They are both ways to initialize data directories.
>
> You'll note that initdb does not behave the way you describe.  It's not
> unreasonable behavior, but it's not the way it currently works.

So you want to default to no group access regardless of the directory
permissions, with an option to enable group access that must be
explicitly specified?  That seems like a reasonable option to me; note
that initdb does seem to chdir() an existing directory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add exclusive backup deprecation notes to documentation

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 8:33 AM Peter Eisentraut
 wrote:
> The documentation changes appear to continue the theme from the other
> thread that the exclusive backup mode is terrible and everyone should
> feel bad about it.  I don't think there is consensus about that.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Custom compression methods

2019-03-18 Thread Ildar Musin
Hi Ildus,

On Fri, Mar 15, 2019 at 12:52 PM Ildus Kurbangaliev <
i.kurbangal...@gmail.com> wrote:

>
> Hi,
> in my opinion this patch is usually skipped not because it is not
> needed, but because of its size. It is not hard to maintain it until
> commiters will have time for it or I will get actual response that
> nobody is going to commit it.
>
> Attached latest set of patches.
>
>
As I understand, the only thing changed since my last review is an
additional
compression method for zlib.

The code looks good. I have one suggestion though. Currently you only
predefine
two compression levels: `best_speed` and `best_compression`. But zlib itself
allows a fine gradation between those two. It is possible to set level to
the
values from 0 to 9 (where zero means no compression at all which I guess
isn't
useful in our case). So I think we should allow user choose between either
textual representation (as you already did) or numeral. Another thing is
that one
can specify, for instance, `best_speed` level, but not `BEST_SPEED`, which
can
be a bit frustrating for user.

Regards,
Ildar Musin


Concurrency bug with vacuum full (cluster) and toast

2019-03-18 Thread Alexander Korotkov
Hi all,

I've discovered bug, when vacuum full fails with error, because it
couldn't find toast chunks deleted by itself.  That happens because
cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
and independently.  That causes heap_page_prune_opt() to clean chunks,
which rebuild_relation() expects to exist.  This bug very rarely
happens on busy systems which actively update toast values.  But I
found way to reliably reproduce it using debugger.

*Setup*

CREATE FUNCTION random_string(seed integer, length integer) RETURNS text
AS $$
SELECT substr(
string_agg(
substr(
encode(
decode(
md5(seed::text || '-' || i::text),
'hex'),
'base64'),
1, 21),
''),
1, length)
FROM generate_series(1, (length + 20) / 21) i; $$
LANGUAGE SQL;

CREATE TABLE test (val text);
INSERT INTO test (random_string(1,10));

*Reproduction steps*

s1-s3 are three parallel PostgreSQL sessions
s3lldb is lldb connected to s1

At first s1 acquires snapshot and holds it.

s1# begin transaction isolation level repeatable read;
s1# select 1;

Then s2 makes multiple updates of our toasted value.

s2# update test set val = random_string(2,10);
s2# update test set val = random_string(3,10);
s2# update test set val = random_string(4,10);
s2# update test set val = random_string(5,10);
s2# update test set val = random_string(6,10);
s2# update test set val = random_string(7,10);

Then s3 starting vacuum full stopping on vacuum_set_xid_limits().

s3lldb# b vacuum_set_xid_limits
s3# vacuum full test;

We pass vacuum_set_xid_limits() making sure old tuple versions made by
s2 would be recently dead for vacuum full.

s3lldb# finish

Then s1 releases snapshot.  Then heap_page_prune_opt() called from
toast accessed would cleanup toast chunks, which vacuum full expects
to be recently dead.

s1# commit;

Finally, we continue our vacuum full and get error!

s3lldb# continue
s3#
ERROR:  unexpected chunk number 50 (expected 2) for toast value 16429
in pg_toast_16387

Attached patch contains dirty fix of this bug, which just prevents
heap_page_prune_opt() from clean tuples, when it's called from
rebuild_relation().  Actually, it's not something I'm proposing to
commit or even review, it might be just some start point for thoughts.

Any ideas?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


cluster_toast_concurrency_fix.patch
Description: Binary data


Re: Fix optimization of foreign-key on update actions

2019-03-18 Thread Peter Eisentraut
On 2019-03-11 12:57, Peter Eisentraut wrote:
> On 2019-02-06 23:15, Peter Eisentraut wrote:
>> On 05/02/2019 17:20, Tom Lane wrote:
>>> What I *don't* like about the proposed patch is that it installs a
>>> new, different comparison rule for the ON UPDATE CASCADE case only.
>>> If we were to go in this direction, I'd think we should try to use
>>> the same comparison rule for all FK row comparisons.
>>
>> That's easy to change.  I had it like that in earlier versions of the
>> patch.  I agree it would be better for consistency, but it would create
>> some cases where we do unnecessary extra work.
> 
> Updated patch with this change made, and some conflicts resolved.

Committed like this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas  wrote:
> I'm getting a regression failure from the 'create_table' test with this:

> Are you seeing that?

Yes -- though the bug is in your revised v18, not the original v18,
which passed CFTester. Your revision fails on Travis/Linux, which is
pretty close to what I see locally, and much less subtle than the test
failures you mentioned:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665

"make check" did pass locally for me with your patch, but "make
check-world" (parallel recipe) did not.

The original v18 passed both CFTester tests about 15 hour ago:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402

I see the bug. You're not supposed to test this way with a heapkeyspace index:

> +   if (P_RIGHTMOST(lpageop) ||
> +   _bt_compare(rel, itup_key, page, P_HIKEY) != 0)
> +   break;

This is because the presence of scantid makes it almost certain that
you'll break when you shouldn't. You're doing it the old way, which is
inappropriate for a heapkeyspace index. Note that it would probably
take much longer to notice this bug if the "consider secondary
factors" patch was also applied, because then you would rarely have
cause to step right here (duplicates would never occupy more than a
single page in the regression tests). The test failures are probably
also timing sensitive, though they happen very reliably on my machine.

> Looking at the patches 1 and 2 again:
>
> I'm still not totally happy with the program flow and all the conditions
> in _bt_findsplitloc(). I have a hard time following which codepaths are
> taken when. I refactored that, so that there is a separate copy of the
> loop for V3 and V4 indexes.

The big difference is that you make the possible call to
_bt_stepright() conditional on this being a checkingunique index --
the duplicate code is indented in that branch of _bt_findsplitloc().
Whereas I break early in the loop when "checkingunique &&
heapkeyspace".

The flow of the original loop not only had less code. It also
contrasted the important differences between heapkeyspace and
!heapkeyspace cases:

/* If this is the page that the tuple must go on, stop */
if (P_RIGHTMOST(lpageop))
break;
cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
if (itup_key->heapkeyspace)
{
if (cmpval <= 0)
break;
}
else
{
/*
 * pg_upgrade'd !heapkeyspace index.
 *
 * May have to handle legacy case where there is a choice of which
 * page to place new tuple on, and we must balance space
 * utilization as best we can.  Note that this may invalidate
 * cached bounds for us.
 */
if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
break;
}

I thought it was obvious that the "cmpval <= 0" code was different for
a reason. It now seems that this at least needs a comment.

I still believe that the best way to handle the !heapkeyspace case is
to make it similar to the heapkeyspace checkingunique case, regardless
of whether or not we're checkingunique. The fact that this bug slipped
in supports that view. Besides, the alternative that you suggest
treats !heapkeyspace indexes as if they were just as important to the
reader, which seems inappropriate (better to make the legacy case
follow the new case, not the other way around). I'm fine with the
comment tweaks that you made that are not related to
_bt_findsplitloc(), though.

I won't push the patches today, to give you the opportunity to
respond. I am not at all convinced right now, though.

--
Peter Geoghegan



Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-18 Thread Yun Li
Thanks a lot for really good points!! I did not expected I will get this
many points of view. :P

I have identical experience with Robert when other extension calculate the
id different as PGSS, PGSS will overwritten that id when it is on. But Tom
got a point that if we centralize the logic that pgss has, then other
extension will have no way to change it unless we have some new config to
toggle pointed out by Julien. Also Tom got the concern  about the current
PGSS jumble query logic is not bullet proof and may take time then impact
the perf.

Let's take one step back. Since queryId is stored in core as Julien pointed
out, can we just add that global to the pg_stat_get_activity and ultimately
exposed in pg_stat_activity view? Then no matter whether PGSS  is on or
off, or however the customer extensions are updating that filed, we expose
that field in that view then enable user to leverage that id to join with
pgss or their extension. Will this sounds a good idea?

Thanks again,
Yun

On Sat, Mar 16, 2019 at 11:01 AM Julien Rouhaud  wrote:

> On Sat, Mar 16, 2019 at 5:20 PM Tom Lane  wrote:
> >
> > Robert Haas  writes:
> > > On Fri, Mar 15, 2019 at 9:50 PM Tom Lane  wrote:
> > >> pg_stat_statements has a notion of query ID, but that notion might be
> > >> quite inappropriate for other usages, which is why it's an extension
> > >> and not core.
> >
> > > Having written an extension that also wanted a query ID, I disagree
> > > with this position.
> >
> > [ shrug... ]  The fact remains that pg_stat_statements's definition is
> > pretty lame.  There's a lot of judgment calls in which query fields
> > it chooses to examine or ignore, and there's been no attempt at all
> > to make the ID PG-version-independent, and I rather doubt that it's
> > platform-independent either.  Nor will the IDs survive a dump/reload
> > even on the same server, since object OIDs will likely change.
> >
> > These things are OK, or at least mostly tolerable, for
> pg_stat_statements'
> > usage ... but I don't think it's a good idea to have the core code
> > dictating that definition to all extensions.  Right now, if you have
> > an extension that needs some other query-ID definition, you can do it,
> > you just can't run that extension alongside pg_stat_statements.
> > But you'll be out of luck if the core code starts filling that field.
> >
> > I'd be happier about having the core code compute a query ID if we
> > had a definition that was not so obviously slapped together.
>
> But the queryId itself is stored in core.  Exposing it in
> pg_stat_activity or log_line_prefix would still allow users to choose
> the implementation of their choice, or none.  That seems like a
> different complaint from asking pgss integration in core to have all
> its metrics available by default (or at least without a restart).
>
> Maybe we could add a GUC for pg_stat_statements to choose whether it
> should set the queryid itself and not, if anyone wants to have its
> metrics but with different queryid semantics?
>


Re: Fix XML handling with DOCTYPE

2019-03-18 Thread Chapman Flack
There might be too many different email threads on this with patches,
but in case it went under the radar, xml-content-2006-3.patch appeared
in my previous message on this thread[1].

It is based on a simple pre-check of the prefix of the input, determining
which form of parse to apply. That may or may not be simpler than parse-
once-save-error-parse-again-report-first-error, but IMV it's a more direct
solution and clearer (the logic is clearly about "how do I determine the way
this input should be parsed?" which is the problem on the table, rather
than "how should I save and regurgitate this libxml error?" which turns the
problem on the table to a different one).

I decided, for a first point of reference, to wear the green eyeshade and
write a pre-check that exactly implements the applicable rules. That gives
a starting point for simplifications that are probably safe.

For example, a bunch of lines at the end have to do with verifying the
content inside of a processing-instruction, after finding where it ends.
We could reasonably decide that, for the purpose of skipping it, knowing
where it ends is enough, as libxml will parse it next and report any errors
anyway.

That would slightly violate my intention of sending input to (the parser
that wasn't asked for) /only/ when it's completely clear (from the prefix
we've seen) that that's where it should go. The relaxed version could do
that in completely-clear cases and cases with an invalid PI ahead of what
looks like a DTD. But you'd pretty much expect both parsers to produce
the same message for a bad PI anyway.

That made me just want to try it now, and--surprise!--the messages from
libxml are not the same. So maybe I would lean to keeping the green-eyeshade
rules in the test, if you can stomach them, but I would understand taking
them out.

Regards,
-Chap

[1] https://www.postgresql.org/message-id/5c8ecaa4.3090...@anastigmatix.net



Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

2019-03-18 Thread Mithun Cy
Thanks Robert,
On Mon, Mar 18, 2019 at 9:01 PM Robert Haas  wrote:

> On Mon, Mar 18, 2019 at 3:04 AM Mithun Cy  wrote:
> > autoprewarm waorker should not be restarted. As per the code
> @apw_start_database_worker@ master starts a worker per database and wait
> until it exit by calling WaitForBackgroundWorkerShutdown.  The call
> WaitForBackgroundWorkerShutdown cannot handle the case if the worker was
> restarted. The WaitForBackgroundWorkerShutdown() get the status
> BGWH_STOPPED from the call GetBackgroundWorkerPid() if worker got
> restarted. So master will next detach the shared memory and next restarted
> worker keep failing going in a unending loop.
>
> Ugh, that seems like a silly oversight.  Does it fix the reported problem?
>

-- Yes this fixes the reported issue, Hans Buschmann has given below steps
to reproduce.

> This seems easy to reproduce:
>
> - Install/create a database with autoprewarm on and pg_prewarm loaded.
> - Fill the autoprewarm cache with some data
> - pg_dump the database
> - drop the database
> - create the database and pg_restore it from the dump
> - start the instance and logs are flooded

-- It is explained earlier [1] that they used older autoprewarm.blocks
which was generated before drop database. So on restrart autoprewarm worker
failed to connect to droped database and then lead to retry loop. This
patch should fix same.

NOTE : Also, another kind of error user might see because of same bug is,
restarted worker getting connected to next database in autoprewarm.blocks
because autoprewarm master updated shared data "apw_state->database =
current_db;" to start new worker for next database. Both restarted worker
and newly created worker will connect to same database(next one) and try to
load same pages. Hence end up with spurious log messages like  "LOG:
autoprewarm successfully prewarmed 13 of 11 previously-loaded blocks"

If I understand correctly, the commit message would be something like this:
>
> ==
> Don't auto-restart per-database autoprewarm workers.
>
> We should try to prewarm each database only once.  Otherwise, if
> prewarming fails for some reason, it will just keep retrying in an
> infnite loop.  The existing code was intended to implement this
> behavior, but because it neglected to set worker.bgw_restart_time, the
> per-database workers keep restarting, contrary to what was intended.
>
> Mithun Cy, per a report from Hans Buschmann
> ==
>
> Does that sound right?
>

-- Yes I Agree.

[1]
https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E202569F21%40gigant.nidsa.net

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Block level parallel vacuum

2019-03-18 Thread Robert Haas
On Thu, Mar 14, 2019 at 3:37 AM Masahiko Sawada  wrote:
> BTW your patch seems to not apply to the current HEAD cleanly and to
> need to update the comment of vacuum().

Yeah, I omitted some hunks by being stupid with 'git'.

Since you seem to like the approach, I put back the hunks I intended
to have there, pulled in one change from your v2 that looked good,
made one other tweak, and committed this.  I think I like what I did
with vacuum_open_relation a bit better than what you did; actually, I
think it cannot be right to just pass 'params' when the current code
is passing params->options & ~(VACOPT_VACUUM).  My approach avoids
that particular pitfall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-18 Thread Nikolay Samokhvalov
Hello

On Sat, Mar 16, 2019 at 7:32 AM Robert Haas  wrote:

> Also, I think this is now the third independent request to expose
> query ID in pg_stat_statements.  I think we should give the people
> what they want.
>

Count me as the 4th.

This would be a very important feature for automated query analysis.
pg_stat_statements lacks query examples, and the only way to get them is
from the logs.
Where we don't have queryid as well. So people end up either doing it
manually or writing
yet another set of nasty regular expressions.

Routing query analysis s a crucial for any large project. If there are
chances to implement
queryid for pg_stat_activity (or anything that will allow to automate query
analysis)
in Postgres 12 or later -- this would be a great news and huge support for
engineers.
Same level as recently implemented sampling for statement logging.

By the way, if queryid goes to the core someday, I'm sure it is worth to
consider using
it in logs as well.

Thanks,
Nik


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-18 Thread Julien Rouhaud
On Mon, Mar 18, 2019 at 6:23 PM Yun Li  wrote:
>
> Let's take one step back. Since queryId is stored in core as Julien pointed 
> out, can we just add that global to the pg_stat_get_activity and ultimately 
> exposed in pg_stat_activity view? Then no matter whether PGSS  is on or off, 
> or however the customer extensions are updating that filed, we expose that 
> field in that view then enable user to leverage that id to join with pgss or 
> their extension. Will this sounds a good idea?

I'd greatly welcome expose queryid exposure in pg_stat_activity, and
also in log_line_prefix.  I'm afraid that it's too late for pg12
inclusion, but I'll be happy to provide a patch for that for pg13.



Re: jsonpath

2019-03-18 Thread Tom Lane
Just another minor bitch about this patch: jsonpath_scan.l has introduced
a typedef called "keyword".  This is causing pgindent to produce seriously
ugly results in libpq, and probably in other places where that is used as
a field or variable name.  Please rename that typedef to something less
generic.

regards, tom lane



Re: Row Level Security − leakproof-ness and performance implications

2019-03-18 Thread Peter Eisentraut
On 2019-02-21 15:56, Pierre Ducroquet wrote:
> I undestand these decisions, but it makes RLS quite fragile, with numerous un-
> documented side-effects. In order to save difficulties from future users, I 
> wrote this patch to the documentation, listing the biggest restrictions I hit 
> with RLS so far.

This appears to be the patch of record for this commit fest entry.

I agree that it would be useful to document and discuss better which
built-in operators are leak-proof and which are not.  But I don't think
the CREATE POLICY reference page is the place to do it.  Note that the
leak-proofness mechanism was originally introduced for security-barrier
views (an early form of RLS if you will), so someone could also
reasonably expect a discussion there.

I'm not sure of the best place to put it.  Perhaps adding a section to
the Functions and Operators chapter would work.

Also, once you start such a list, there will be an expectation that it's
complete.  So that would need to be ensured.  You only list a few things
you found.  Are there others?  How do we keep this up to date?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Block level parallel vacuum

2019-03-18 Thread Robert Haas
On Thu, Mar 14, 2019 at 3:37 AM Masahiko Sawada  wrote:
> Attached the updated patch you proposed and the patch that converts
> the grammer productions for the VACUUM option on top of the former
> patch. The latter patch moves VacuumOption to vacuum.h since the
> parser no longer needs such information.

Committed.

> If we take this direction I will change the parallel vacuum patch so
> that it adds new PARALLEL option and adds 'nworkers' to VacuumParams.

Sounds good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 1:43 PM Mithun Cy  wrote:
>> Does that sound right?
>
> -- Yes I Agree.

Committed with a little more tweaking of the commit message, and
back-patched to v11.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



[PATCH] src/test/modules/dummy_index -- way to test reloptions from inside of access method

2019-03-18 Thread Nikolay Shaplov
Hi!

This patch introduce a dummy_index access method module, that does not do any 
indexing at all, but allow to test reloptions from inside of access method 
extension.

This patch is part of my bigger work on reloptions refactoring.

It came from 
https://www.postgresql.org/message-id/20190220060832.GI15532%40paquier.xyz 
thread where I suggested to add a "enum" reloption type, and we came to 
conclusion that we need to test how this new option works for access method 
created from extension (it does not work in the same way as in-core access 
methods) . But we can't add this option to bloom index, so we need an index 
extension that can be freely used for tests.

So I created src/test/modules/dummy_index, it does no real indexing, but it 
has all types of reloptions that can be set (reloption_int, reloption_real, 
reloption_bool, reloption_string and reloption_string2). It also has set of 
boolean GUC variables that enables test output concerning certain reloption:
(do_test_reloption_int, do_test_reloption_real, do_test_reloption_bool and 
do_test_reloption_string and  do_test_reloption_string2) also set 
do_test_reloptions to true to get any output at all.
Dummy index will print this output when index is created, and when record is 
inserted (this needed to check if your ALTER TABLE did well)
Then you just use normal regression tests: turns on test output, sets some 
reloption and check test output, that it properly reaches the access method 
internals.

While writing this module I kept in mind the idea that this module can be also 
used for other am-related tests, so I separated the code into two parts: 
dummy_index.c has only code related to implementation of an empty access 
method, and all code related to reloptions tests  were stored into 
direloptions.c. So in future somebody can add di[what_ever_he_wants].c whith 
his own tests code, add necessary calls to dummy_index.c, create some GUC 
variables, and has his own feature tested.

So I kindly ask you to review and commit this module, so I would be able to 
continue my work on reloptions refactoring...

Thanks! 

diff --git a/.gitignore b/.gitignore
index 794e35b..37331c2 100644
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 19d60a5..aa34320 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  dummy_index \
 		  dummy_seclabel \
 		  snapshot_too_old \
 		  test_bloomfilter \
diff --git a/src/test/modules/dummy_index/.gitignore b/src/test/modules/dummy_index/.gitignore
new file mode 100644
index 000..5dcb3ff
--- /dev/null
+++ b/src/test/modules/dummy_index/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/dummy_index/Makefile b/src/test/modules/dummy_index/Makefile
new file mode 100644
index 000..95c1fcc
--- /dev/null
+++ b/src/test/modules/dummy_index/Makefile
@@ -0,0 +1,21 @@
+# contrib/bloom/Makefile
+
+MODULE_big = dummy_index
+OBJS = dummy_index.o direloptions.o $(WIN32RES)
+
+EXTENSION = dummy_index
+DATA = dummy_index--1.0.sql
+PGFILEDESC = "dummy index access method - needed for all kinds of tests"
+
+REGRESS = reloptions
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/dummy_index
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/dummy_index/README b/src/test/modules/dummy_index/README
new file mode 100644
index 000..b704112
--- /dev/null
+++ b/src/test/modules/dummy_index/README
@@ -0,0 +1,34 @@
+DUMMY INDEX
+
+Dummy index, an index module for testing am-related internal calls that can't be
+properly tested inside production indexes, but better to be tested inside actual
+access method module.
+
+Guidelines:
+1. Keep this index as simple as as possible. It should not do any real indexing
+(unless you need it for your tests), if you need it make it as simple as
+possible;
+2. Keep all code related to feature you test inside id[name_of_a_feature].c
+file. dummy_index.c file should contain code that is needed for everyone, and
+make calls to feature-related code;
+3. Code related to your feature tests should be enabled and disabled by GUC
+variables. Create as many boolean GUC variables as you need and set them on so
+your test will know when it is time to do test output. Thus output from
+different features tests are not interfered;
+4. Add section to this README file that describes the general idea of what your
+tests do.
+
+RELOPTIONS TESTS
+
+Here we check that values of reloptions properly reaches am-internals, and that
+all reloption-related code works as expected. Several GUC variables are defined
+to control test's output. do_test_reloptions turns on and off any
+reloptions-related output. You will also get some test out

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-18 Thread Peter Eisentraut
On 2019-02-20 07:20, Tsunakawa, Takayuki wrote:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
>> Hm.  Putting a list header for a purely-local data structure into shared
>> memory seems quite ugly.  Isn't there a better place to keep that?
> 
> Agreed.  I put it in the global variable.

I think there is agreement on the principles of this patch.  Perhaps it
could be polished a bit.

Your changes in LOCALLOCK still refer to PGPROC, from your first version
of the patch.

I think the reordering of struct members could be done as a separate
preliminary patch.

Some more documentation in the comment before dlist_head LocalLocks to
explain this whole mechanism would be nice.

You posted a link to some performance numbers, but I didn't see the test
setup explained there.  I'd like to get some more information on this
impact of this.  Is there an effect with 100 tables, or do you need 10?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: jsonpath

2019-03-18 Thread Alexander Korotkov
On Mon, Mar 18, 2019 at 10:08 PM Tom Lane  wrote:
> Just another minor bitch about this patch: jsonpath_scan.l has introduced
> a typedef called "keyword".  This is causing pgindent to produce seriously
> ugly results in libpq, and probably in other places where that is used as
> a field or variable name.  Please rename that typedef to something less
> generic.

Ooops... I propose to rename it to KeyWord, which is already
typedef'ed in formatting.c.  See the attached patch.  Is it OK?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


jsonpath_rename_typedef.patch
Description: Binary data


Re: Row Level Security − leakproof-ness and performance implications

2019-03-18 Thread Peter Eisentraut
On 2019-02-28 00:03, Joe Conway wrote:
> What if we provided an option to redact all client messages (leaving
> logged messages as-is). Separately we could provide a GUC to force all
> functions to be resolved as leakproof. Depending on your requirements,
> having both options turned on could be perfectly acceptable.

There are two commit fest entries for this thread, one in Pierre's name
and one in yours.  Is your entry for the error message redacting
functionality?  I think that approach has been found not to actually
satisfy the leakproofness criteria.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH][PROPOSAL] Add enum releation option type

2019-03-18 Thread Nikolay Shaplov
В письме от среда, 20 февраля 2019 г. 15:08:32 MSK пользователь Michael 
Paquier написал:

> > 2.5  May be this src/test/modules dummy index is subject to another patch.
> > So I will start working on it right now, but we will do this work not
> > dependent to any other patches. And just add there what we need when it
> > is ready. I would actually add there some code that checks all types of
> > options, because we actually never check that option value from reloptons
> > successfully reaches extension code. I did this checks manually, when I
> > wrote that bigger reloption patch, but there is no facilities to do
> > regularly check is via test suit. Creating dummy index will create such
> > facilities.
> 
> bloom is a user-facing extension, while src/test/modules are normally
> not installed (there is an exception for MSVC anyway..), so stressing
> add_enum_reloption in src/test/modules looks more appropriate to me,
> except if you can define an option which is useful for the user and is
> an enum.
I've created a module in src/test/modules where we would be able to add enum 
support when that module is commited.

https://commitfest.postgresql.org/23/2064/

I've filed it as a separate patch (it is standalone work as I can feel it). 
Sadly I did not manage to prepare it before this commitfest, so I've added it 
to a next one. :-((





Re: Online verification of checksums

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 2:06 AM Michael Paquier  wrote:
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,
> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> for example).

Yeah, exactly.  It may be that there is a good way to avoid those
issues without interacting with the server and that would be nice, but
... as far as I can see, nobody's figured out a way that's reliable
yet, and all of the solutions proposed so far basically amount to
"let's ignore things that might be serious problems because they might
be transient" and/or "let's retry and see if the problem goes away."
I'm more sanguine about a retry-based solution than an
ignore-possible-problems solution, but what's been proposed so far
seems quite prone to retrying so fast that it makes no difference, and
it's not clear how much code complexity we'd have to add to do better
or how reliable it would be even then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rare SSL failures on eelpout

2019-03-18 Thread Tom Lane
I wrote:
> ... I don't like pqHandleSendFailure all that much: it has strong
> constraints on what state libpq has to be in, as a consequence of which
> it's called from a bunch of ad-hoc places, and can't be called from
> some others.  It's kind of accidental that it will work here.
> I was toying with the idea of getting rid of that function in
> favor of a design like this:
> * Handle any socket-write failure at some low level of libpq by
> recording the fact that the error occurred (plus whatever data we
> need to produce a message about it), and then starting to just
> discard output data.
> * Eventually, we'll try to read.  Process any available input data
> as usual (and, if we get a read error, report that).  When no more
> input data is available, if a socket write failure has been recorded,
> report that much as if it were an incoming ERROR message, and then
> shut down the connection.
> This would essentially give us pqHandleSendFailure-like behavior
> across the board, which might be enough to fix this problem as well as
> bug #15598.  Or not ... as you say, we haven't thoroughly understood
> either issue, so it's possible this wouldn't do it.

Here's a draft patch along that line.  It's gratifyingly small, and
it does fix the SSL problem for me.  I have no way of investigating
whether it fixes bug #15598, but it might.

Aside from the SSL cert-verify issue, I've checked the behavior when
the backend is shut down ("pg_ctl stop") between queries, and when
the backend is kill 9'd mid-query.  The shutdown case reacts well with
or without SSL:

regression=# select 2+2;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The backend-crash case is fine without SSL:

regression=# select * from tenk1 t1, tenk1 t2, tenk1 t3;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

but a bit less nice with it:

regression=# select * from tenk1 t1, tenk1 t2, tenk1 t3;
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Succeeded.

That seems cosmetic --- we just haven't expended the same effort on
what to print for connection EOF with SSL as without it.  (Also,
you get that same result without this patch.)

Of course, this hardly counts as exhaustive testing, especially since
I only tried one OpenSSL version.

> One thing that isn't real clear to me is how much timing sensitivity
> there is in "when no more input data is available".  Can we assume that
> if we've gotten ECONNRESET or an allied error from a write, then any
> data the far end might've wished to send us is already available to
> read?  In principle, since TCP allows shutting down either transmission
> direction independently, the server could close the read side of its
> socket while continuing to transmit data.  In practice, PG servers
> don't do that; but could network timing glitches create the same effect?
> Even if it's possible, does it happen enough to worry about?
> The reason I'm concerned is that I don't think it'd be bright to ignore a
> send error until we see input EOF, which'd be the obvious way to solve a
> timing problem if there is one.  If our send error was some transient
> glitch and the connection is really still open, then we might not get EOF,
> but we won't get a server reply either because no message went to the
> server.  You could imagine waiting some small interval of time before
> deciding that it's time to report the write failure, but ugh.

As written, this patch does seem vulnerable to this concern, if it's real.
We could address it for post-connection failures by tweaking PQgetResult
so that it doesn't call pqWait if there's already a write error recorded;
but I'm far from sure that that would be a net improvement.  Also, making
a comparable change to the behavior of the PQconnectPoll state machine
might be tricky.

My current feeling is that this is OK to put in HEAD but I think the
risk-reward ratio isn't very good for the back branches.  Even with
an OpenSSL version where this makes a difference, the problematic
behavior is pretty hard to hit.  So I'm a bit inclined to do nothing
in the back branches.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index c96a52b..e3bf6a7 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -537,6 +537,10 @@ pqDropServerData(PGconn *conn)
 	conn->last_sqlstate[0] = '\0';
 	conn->auth_req_received = false;
 	conn->password_needed = false;
+	conn->write_failed = false;
+	if (conn->write_err_msg)
+		free(conn->write_err_msg);
+	conn->write_err_ms

Re: Row Level Security − leakproof-ness and performance implications

2019-03-18 Thread Joe Conway
On 3/18/19 3:52 PM, Peter Eisentraut wrote:
> On 2019-02-28 00:03, Joe Conway wrote:
>> What if we provided an option to redact all client messages (leaving
>> logged messages as-is). Separately we could provide a GUC to force all
>> functions to be resolved as leakproof. Depending on your requirements,
>> having both options turned on could be perfectly acceptable.
> 
> There are two commit fest entries for this thread, one in Pierre's name
> and one in yours.  Is your entry for the error message redacting
> functionality?  I think that approach has been found not to actually
> satisfy the leakproofness criteria.


It is a matter of opinion with regard to what the criteria actually is,
and when it ought to apply. But in any case the clear consensus was
against me, so I guess I'll assume "my patch was rejected by PostgreSQL
all I got was this tee shirt" (...I know I have one that says something
like that somewhere...) ;-)

I have no idea what the other entry is all about as I have not had the
time to look.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Online verification of checksums

2019-03-18 Thread Michael Banck
Hi,

Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck  wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that.  Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> > 
> > [...]
> > 
> > > I would think that we could just do:
> > > 
> > >   insert_location = 0;
> > >   r = read(BLCKSIZE - insert_location);
> > >   if (r < 0) error();
> > >   if (r == 0) EOF detected, move to next
> > >   if (r < (BLCKSIZE - insert_location)) {
> > >     insert_location += r;
> > >     continue;
> > >   }
> > > 
> > > At this point, we should have a full block, do our checks...
> > 
> > Well, we need to read() into some buffer which you have ommitted.
> 
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
> 
> In other words the read would look more like:
> 
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> 
> And then you have to reset insert_ptr once you have a full block.

Ok, thanks for clearing that up.

I've tried to do that now in the attached, does that suit you?

> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?

I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases?  I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.

I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:

+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.

Wording improvements welcome.


Michael


-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..c1fa167508 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,10 +37,15 @@ PostgreSQL documentation
   Description
   
pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
+
+  
+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.
+  
  
 
  
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..a82ce7d7b4 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,18 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	int			block_loc = 0;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -104,30 +116,112 @@ scan_file(const char *fn, BlockNumber segmentno)
 	for (blockno = 0;; blockno++)
 	{
 		uint16		csum;
-		int			r = read(f, buf.data, BLCKSZ);
+		int			r = read(f, buf.data + block_loc, BLCKSZ - block_loc);
 
 		if (r == 0)
+		{
+			/*
+			 * We had a short read and got an EOF

Re: outdated reference to tuple header OIDs

2019-03-18 Thread Andres Freund
Hi,

On 2019-03-14 15:49:40 +0800, John Naylor wrote:
> It seems this is a leftover from commit 578b229718e8. Patch attached.

Thanks for noticing. Pushed.

Greetings,

Andres Freund



Re: chained transactions

2019-03-18 Thread Fabien COELHO



Hallo Peter,


Updated patch.  I have squashed the two previously separate patches
together in this one.


Ok.


I do not understand the value of the SAVEPOINT in the tests.


The purpose of the SAVEPOINT in the test is because it exercises
different switch cases in CommitTransactionCommand() and
AbortCurrentTransaction().  It's not entirely comprehensible from the
outside, but code coverage analysis confirms it.


Ok.


Otherwise I'm okay with this patch.

About the second patch, I'm still unhappy with functions named commit &
rollback doing something else, which result in somehow strange code, where
you have to guess that the transaction is restarted in all cases, either
within the commit function or explicitely.


I have updated the SPI interface with your suggestions.  I agree it's
better that way.


Patch applies cleanly, compiles, make check ok, doc build ok.

Minor remarks:

In "xact.c", maybe I'd assign blockState in the else branch, instead of 
overriding it?


About the static _SPI_{commit,rollback} functions: I'm fine with these 
functions, but I'm not sure about their name. Maybe 
_SPI_chainable_{commit,rollback} would be is clearer about their content?


Doc looks clear to me. ISTM "chain" should be added as an index term?

Tests look ok. Maybe I'd have series with mixed commit & rollback, instead 
of only commit & only rollback?


--
Fabien.



Re: jsonpath

2019-03-18 Thread Tom Lane
Alexander Korotkov  writes:
> On Mon, Mar 18, 2019 at 10:08 PM Tom Lane  wrote:
>> Just another minor bitch about this patch: jsonpath_scan.l has introduced
>> a typedef called "keyword".  This is causing pgindent to produce seriously
>> ugly results in libpq, and probably in other places where that is used as
>> a field or variable name.  Please rename that typedef to something less
>> generic.

> Ooops... I propose to rename it to KeyWord, which is already
> typedef'ed in formatting.c.  See the attached patch.  Is it OK?

I had in mind JsonPathKeyword or something like that.  If you re-use
formatting.c's typedef name, pgindent won't care, but it's possible
you'd be in for unhappiness when trying to look at these structs in
gdb for instance.

regards, tom lane



Re: jsonpath

2019-03-18 Thread Pavel Stehule
po 18. 3. 2019 v 21:23 odesílatel Tom Lane  napsal:

> Alexander Korotkov  writes:
> > On Mon, Mar 18, 2019 at 10:08 PM Tom Lane  wrote:
> >> Just another minor bitch about this patch: jsonpath_scan.l has
> introduced
> >> a typedef called "keyword".  This is causing pgindent to produce
> seriously
> >> ugly results in libpq, and probably in other places where that is used
> as
> >> a field or variable name.  Please rename that typedef to something less
> >> generic.
>
> > Ooops... I propose to rename it to KeyWord, which is already
> > typedef'ed in formatting.c.  See the attached patch.  Is it OK?
>
> I had in mind JsonPathKeyword or something like that.  If you re-use
> formatting.c's typedef name, pgindent won't care, but it's possible
> you'd be in for unhappiness when trying to look at these structs in
> gdb for instance.
>

JsonPathKeyword is better verbose

Pavel


> regards, tom lane
>


Re: [PATCH] remove repetitive characters in fdwhandler.sgml

2019-03-18 Thread Andres Freund
On 2019-03-12 23:19:23 -0700, Andres Freund wrote:
> On 2019-03-13 14:55:59 +0900, Etsuro Fujita wrote:
> > (2019/03/13 14:02), Michael Paquier wrote:
> > > On Tue, Mar 12, 2019 at 01:37:04AM +, Zhang, Jie wrote:
> > > > Here is a tiny patch removing repetitive characters [if] in 
> > > > fdwhandler.sgml.
> > > 
> > >   
> > > - This function should store the tuple into the provided, or clear it 
> > > if if
> > > + This function should store the tuple into the provided, or clear it 
> > > if
> > >the row lock couldn't be obtained.  The row lock type to acquire is
> > > 
> > > The typo is clear, however the formulation of the full sentence is
> > > confusing.  This function should store the tuple into the provided
> > > slot, no?
> > 
> > Yeah, I think so too.
> 
> Sorry for that, I'll fix the sentence tomorrow. Andres vs Grammar: 3 :
> 3305.

And pushed.  Thanks for the report!



Re: partitioned tables referenced by FKs

2019-03-18 Thread Alvaro Herrera
Hi Jesper

On 2019-Mar-01, Jesper Pedersen wrote:

> I'm getting a failure in the pg_upgrade test:
> 
>  --
> +-- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner:
> jpedersen
> +--
> +
> +ALTER TABLE ONLY regress_fk.pk5
> +ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
> +
> +
> +--
> 
> when running check-world.

So I tested this case just now (after fixing a couple of bugs) and see a
slightly different problem now: those excess lines are actually just
that pg_dump chose to print the constraints in different order, as there
are identical lines with "-" in the diff I get.  The databases actually
*are* identical as far as I can tell.  I haven't yet figured out how to
fix this; of course, the easiest solution is to just drop the regress_fk
schema at the end of the foreign_key.sql regression test, but I would
prefer a different solution.

I'm going to post the new version of the patch separately.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--- /pgsql/build/part5/src/bin/pg_upgrade/tmp_check/dump1.sql   2019-03-18 
17:28:00.201071102 -0300
+++ /pgsql/build/part5/src/bin/pg_upgrade/tmp_check/dump2.sql   2019-03-18 
17:28:16.915627217 -0300
@@ -367111,6 +367111,14 @@
 
 
 --
+-- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner: alvherre
+--
+
+ALTER TABLE ONLY regress_fk.pk5
+ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
+
+
+--
 -- Name: pk51 pk51_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner: alvherre
 --
 
@@ -367127,14 +367135,6 @@
 
 
 --
--- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner: alvherre
---
-
-ALTER TABLE ONLY regress_fk.pk5
-ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
-
-
---
 -- Name: at_partitioned_a_idx; Type: INDEX; Schema: public; Owner: alvherre
 --
 


Re: Progress reporting for pg_verify_checksums

2019-03-18 Thread Michael Banck
Hi,

thanks for the additional review!

Am Donnerstag, den 14.03.2019, 11:54 +0900 schrieb Kyotaro HORIGUCHI:
> At Wed, 13 Mar 2019 16:25:15 +0900, Michael Paquier  
> wrote in <20190313072515.gb2...@paquier.xyz>
> > On Wed, Mar 13, 2019 at 07:22:28AM +0100, Fabien COELHO wrote:
> > > Does not apply because of the renaming committed by Michaël.
> > > 
> > > Could you rebase?
> > 
> > This stuff touches pg_checksums.c, so you may want to wait one day or
> > two to avoid extra work...  I think that I'll be able to finish the
> > addition of the --enable and --disable switches by then.

I have rebased it now.

> > + /* Make sure we report at most once every tenth of a second */
> > + if ((INSTR_TIME_GET_MILLISEC(now) -
> > +  INSTR_TIME_GET_MILLISEC(last_progress_update) < 100) && 
> > !force)
> 
> I'm not a skilled gamer and 10Hz update is a bit hard for my
> eyes:p The second hand of old clocks ticks at 4Hz. I think it is
> enough frequently.

Ok.

> > 841/1516 MB (55%, 700 MB/s)
> 
> Differently from other prgress reporting project, estimating ETC
> (estimated time to completion), which is in the most important,
> is quite easy. (pgbench does that.) And I'd like to see a
> progress bar there but it might be overdoing. I'm not sure let
> the current size occupy a part of screen width is needed.  I
> don't think total size is needed to be fixed in MB and I think at
> most four or five digits are enough. (That is the same for
> speed.)
> 
> If the all of aboves are involved, the line would look as the
> follows.
> 
> [=== ] ( 63% of 12.53 GB, 179 MB/s, ETC 26s)
> 
> # Note that this is just an opinion.
> 
> (pg_checksum runs fast at the beginning so ETC behaves somewhat
> strange in the meanwhile.)

I haven't changed that for now as it seems to be a bit more involved.
I'd like to hear other opinions on whether that is worthwhile?

> > +#define MEGABYTES 1048576
> 
>  1024 * 1024 is preferable than that many digits.

Good point, changed that way.

> > + /* we handle SIGUSR1 only, and toggle the value of show_progress */
> > + if (signum == SIGUSR1)
> > + show_progress = !show_progress;
> 
> SIGUSR1 *toggles* progress. 

Not sure what you mean here, are you saying it should be called
toggle_progress and not show_progress? I think it was like that
originally but got changed due to review comments.

> A garbage line is left alone after turning it off. It would be
> erasable. I'm not sure which is better, though.

How about just printing a "\n" in the signal handler if we are in a
terminal?  I think erasing the line might get too clever and will be a
portability hazard? I have done that now in the attached patch, let me
know what you think.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..c790170953 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -80,6 +80,19 @@ PostgreSQL documentation
  
 
  
+  -P
+  --progress
+  
+   
+Enable progress reporting. Turning this on will deliver an approximate
+progress report during the checksum verification. Sending the
+SIGUSR1 signal will toggle progress reporting
+on or off during the verification run (not available on Windows).
+   
+  
+ 
+
+ 
-V
--version

diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..a771f67c18 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -14,6 +14,8 @@
 #include "postgres_fe.h"
 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -21,6 +23,7 @@
 #include "common/controldata_utils.h"
 #include "getopt_long.h"
 #include "pg_getopt.h"
+#include "portability/instr_time.h"
 #include "storage/bufpage.h"
 #include "storage/checksum.h"
 #include "storage/checksum_impl.h"
@@ -34,9 +37,18 @@ static ControlFileData *ControlFile;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool show_progress = false;
 
 static const char *progname;
 
+/*
+ * Progress status information.
+ */
+int64		total_size = 0;
+int64		current_size = 0;
+instr_time	last_progress_update;
+instr_time	scan_started;
+
 static void
 usage(void)
 {
@@ -47,6 +59,7 @@ usage(void)
 	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -r RELFILENODE check only relation with specified re

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost
Greetings,

On Tue, Mar 19, 2019 at 04:15 Michael Banck 
wrote:

> Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> > On Mon, Mar 18, 2019 at 15:52 Michael Banck 
> wrote:
> > > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > > Thanks for that.  Reading through the code though, I don't entirely
> > > > understand why we're making things complicated for ourselves by
> trying
> > > > to seek and re-read the entire block, specifically this:
> > >
> > > [...]
> > >
> > > > I would think that we could just do:
> > > >
> > > >   insert_location = 0;
> > > >   r = read(BLCKSIZE - insert_location);
> > > >   if (r < 0) error();
> > > >   if (r == 0) EOF detected, move to next
> > > >   if (r < (BLCKSIZE - insert_location)) {
> > > > insert_location += r;
> > > > continue;
> > > >   }
> > > >
> > > > At this point, we should have a full block, do our checks...
> > >
> > > Well, we need to read() into some buffer which you have ommitted.
> >
> > Surely there’s a buffer the read in the existing code is passing in,
> > you just need to offset by the current pointer, sorry for not being
> > clear.
> >
> > In other words the read would look more like:
> >
> > read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> >
> > And then you have to reset insert_ptr once you have a full block.
>
> Ok, thanks for clearing that up.
>
> I've tried to do that now in the attached, does that suit you?


Yes, that’s what I was thinking.  I’m honestly not entirely convinced that
the lseek() efforts still need to be put in- I would have thought it’d be
fine to simply check the LSN on a checksum failure and mark it as skipped
if the LSN is past the current checkpoint.  That seems like it would make
things much simpler, but I’m also not against keeping that logic now that
it’s in, provided it doesn’t cause issues

> Yes, the question was more like this: have you ever seen a read return
> > a partial result when you know you’re in the middle somewhere of an
> > existing file and the length of the file hasn’t been changed by
> > something else..?
>
> I don't think I've seen that, but that wouldn't turn up in regular
> testing anyway I guess but only in pathological cases?  I guess we are
> probably dealing with this in the current version of the patch, but I
> can't say for certain as it sounds pretty difficult to test.


Yeah, a lot of things in this area are unfortunately difficult to test.
I’m glad to hear that it doesn’t sound like you’ve seen it though.

I have also added a paragraph to the documentation about possilby
> skipping new or recently updated pages:
>
> +   If the cluster is online, pages that have been (re-)written since the
> last
> +   checkpoint will not count as checksum failures if they cannot be read
> or
> +   verified correctly.


I would flip this around:

——-
In an online cluster, pages are being concurrently written to the files
while the check is being run, leading to possible torn pages or partial
reads.  When the tool detects a concurrently written page, indicated by the
page’s LSN being beyond the checkpoint the tool started at, that page will
be reported as skipped.  Note that in a crash scenario, any pages written
since the last checkpoint will be replayed from the WAL.
——-

Now here’s the $64 question- have you tested this latest version under
load..?  If not, could you?  And when you do, can you report back what the
results are?  Do you still see any actual checksum failures?  Do the number
of skipped pages seem reasonable in your tests or is there a concern there?

If you still see actual checksum failures which aren’t because the LSN is
higher than the checkpoint, or because of a short read, then we need to
investigate further but hopefully that isn’t happening now.  I think a lot
of the concerns raised on this thread about wanting to avoid false
positives are because the torn page (with higher LSN than current
checkpoint) and short read cases were previously reported as failures when
they really are expected.  Let’s test this as much as we can and make sure
we aren’t seeing false positives anymore.

Thanks!

Stephen


Re: partitioned tables referenced by FKs

2019-03-18 Thread Alvaro Herrera
On 2019-Feb-28, Amit Langote wrote:

> I'd like to hear your thoughts on some suggestions to alter the structure
> of the reorganized code around foreign key addition/cloning.  With this
> patch adding support for foreign keys to reference partitioned tables, the
> code now has to consider various cases due to the possibility of having
> partitioned tables on both sides of a foreign key, which is reflected in
> the complexity of the new code.

I spent a few hours studying this and my conclusion is the opposite of
yours: we should make addFkRecurseReferencing the recursive one, and
CloneFkReferencing a non-recursive caller of that.  So we end up with
both addFkRecurseReferenced and addFkRecurseReferencing as recursive
routines, and CloneFkReferenced and CloneFkReferencing being
non-recursive callers of those.  With this structure change there is one
more call to CreateConstraintEntry than before, and now there are two
calls of tryAttachPartitionForeignKey instead of one; I think with this
new structure things are much simpler.  I also changed
CloneForeignKeyConstraints's API: instead of returning a list of cloned
constraint giving its caller the responsibility of adding FK checks to
phase 3, we now give CloneForeignKeyConstraints the 'wqueue' list, so
that it can add the FK checks itself.  It seems much cleaner this way.

> Also, it seems a bit confusing that there is a CreateConstraintEntry call
> in addFkRecurseReferenced() which is creating a constraint on the
> *referencing* relation, which I think should be in
> ATAddForeignKeyConstraint, the caller.  I know we need to create a copy of
> the constraint to reference the partitions of the referenced table, but we
> might as well put it in CloneFkReferenced and reverse who calls who --
> make addFkRecurseReferenced() call CloneFkReferenced and have the code to
> create the cloned constraint and action triggers in the latter.  That will
> make the code to handle different sides of foreign key look similar, and
> imho, easier to follow.

Well, if you think about it, *all* the constraints created by all these
routines are in the referencing relations.  The only question here is
*why* we create those tuples; in the case of the ...Referenced routines,
it's because of the partitions in the referenced side; in the case of
the ...Referencing routines, it's because of partitions in the
referencing side.  I think changing it the way you suggest would be even
more confusing.


As discussed in the other subthread, I'm not making any effort to reuse
an existing constraint defined in a partition of the referenced side; as
far as I can tell that's a nonsensical transformation.


A pretty silly bug remains here.  Watch:

create table pk (a int primary key) partition by list (a);
create table pk1 partition of pk for values in (1);
create table fk (a int references pk);
insert into pk values (1);
insert into fk values (1);
drop table pk cascade;

Note that the DROP of the main PK table is pretty aggressive: since it
cascades, you want it to drop the constraint on the FK side.  This would
work without a problem if 'pk' was a non-partitioned table, but in this
case it fails:

alvherre=# drop table pk cascade;
NOTICE:  drop cascades to constraint fk_a_fkey on table fk
ERROR:  removing partition "pk1" violates foreign key constraint "fk_a_fkey1"
DETALLE:  Key (a)=(1) still referenced from table "fk".

The reason is that the "pre drop check" that's done before allow a drop
of the partition doesn't realize that the constraint is also being
dropped (and so the check ought to be skipped).  If you were to do just
"DROP TABLE pk1" then this complaint would be correct, since it would
leave the constraint in an invalid state.  But here, it's bogus and
annoying.  You can DELETE the matching values from table FK and then the
DROP works.  Here's another problem caused by the same misbehavior:

alvherre=# drop table pk, fk;
ERROR:  removing partition "pk1" violates foreign key constraint "fk_a_fkey1"
DETALLE:  Key (a)=(1) still referenced from table "fk".

Note here we want to get rid of table 'fk' completely.  If you split it
up in a DROP of fk, followed by a DROP of pk, it works.

I'm not yet sure what's the best way to attack this.  Maybe the
"pre-drop check" needs a completely different approach.  The simplest
approach is to prohibit a table drop or detach for any partitioned table
that's referenced by a foreign key, but that seems obnoxious and
inconvenient.


I still haven't put back the code in "#if 0".

FWIW I think we should add an index on pg_constraint.confrelid now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 26ee4ece2435ef65b36c866dcc38a527042136e2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 28 Nov 2018 11:52:00 -0300
Subject: [PATCH v6 1/3] Rework deleteObjectsInList to allow objtype-specific
 checks

This doesn't change any functionality yet.
---
 src/backend/catalo

Re: [HACKERS] Custom compression methods

2019-03-18 Thread Tomas Vondra



On 3/15/19 12:52 PM, Ildus Kurbangaliev wrote:
> On Fri, 15 Mar 2019 14:07:14 +0400
> David Steele  wrote:
> 
>> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
>>> On Thu, Mar 7, 2019 at 10:43 AM David Steele >> > wrote:
>>>
>>> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
>>>
>>>  > there are another set of patches.
>>>  > Only rebased to current master.
>>>  >
>>>  > Also I will change status on commitfest to 'Needs review'.
>>>
>>> This patch has seen periodic rebases but no code review that I
>>> can see since last January 2018.
>>>
>>> As Andres noted in [1], I think that we need to decide if this
>>> is a feature that we want rather than just continuing to push it
>>> from CF to CF.
>>>
>>>
>>> Yes.  I took a look at code of this patch.  I think it's in pretty
>>> good shape.  But high level review/discussion is required.
>>
>> OK, but I think this patch can only be pushed one more time, maximum, 
>> before it should be rejected.
>>
>> Regards,
> 
> Hi,
> in my opinion this patch is usually skipped not because it is not
> needed, but because of its size. It is not hard to maintain it until
> commiters will have time for it or I will get actual response that
> nobody is going to commit it.
> 

That may be one of the reasons, yes. But there are other reasons, which
I think may be playing a bigger role.

There's one practical issue with how the patch is structured - the docs
and tests are in separate patches towards the end of the patch series,
which makes it impossible to commit the preceding parts. This needs to
change. Otherwise the patch size kills the patch as a whole.

But there's a more important cost/benefit issue, I think. When I look at
patches as a committer, I naturally have to weight how much time I spend
on getting it in (and then dealing with fallout from bugs etc) vs. what
I get in return (measured in benefits for community, users). This patch
is pretty large and complex, so the "costs" are quite high, while the
benefits from the patch itself is the ability to pick between pg_lz and
zlib. Which is not great, and so people tend to pick other patches.

Now, I understand there's a lot of potential benefits further down the
line, like column-level compression (which I think is the main goal
here). But that's not included in the patch, so the gains are somewhat
far in the future.

But hey, I think there are committers working for postgrespro, who might
have the motivation to get this over the line. Of course, assuming that
there are no serious objections to having this functionality or how it's
implemented ... But I don't think that was the case.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Progress reporting for pg_verify_checksums

2019-03-18 Thread Fabien COELHO



I have rebased it now.


Thanks. Will look at it.


If the all of aboves are involved, the line would look as the
follows.

[=== ] ( 63% of 12.53 GB, 179 MB/s, ETC 26s)

# Note that this is just an opinion.

(pg_checksum runs fast at the beginning so ETC behaves somewhat
strange in the meanwhile.)


I haven't changed that for now as it seems to be a bit more involved.
I'd like to hear other opinions on whether that is worthwhile?


I think that the bar is overkill, but ETC is easy and nice.


+ /* we handle SIGUSR1 only, and toggle the value of show_progress */
+ if (signum == SIGUSR1)
+ show_progress = !show_progress;


SIGUSR1 *toggles* progress.


Not sure what you mean here,


Probably it is meant to simplify the comment?

--
Fabien.

Re: What to name the current heap after pluggable storage / what to rename?

2019-03-18 Thread Andres Freund
Hi,

On 2019-03-13 08:29:47 -0400, Robert Haas wrote:
> On Tue, Mar 12, 2019 at 8:39 PM Andres Freund  wrote:
> > > I like that option.
> >
> > In that vein, does anybody have an opinion about the naming of
> > a) HeapUpdateFailureData, which will be used for different AMs
> > b) HTSU_Result itself, which'll be the return parameter for
> >update/delete via tableam
> > c) Naming of HTSU_Result members (like HeapTupleBeingUpdated)
> >
> > I can see us doing several things:
> > 1) Live with the old names, explain the naming as historical baggage
> > 2) Replace names, but add typedefs / #defines for backward compatibility
> > 3) Rename without backward compatibility
> 
> I think I have a mild preference for renaming HeapUpdateFailureData to
> TableUpdateFailureData, but I'm less excited about renaming
> HTSU_Result or its members.  I don't care if you want to do
> s/HeapTuple/TableTuple/g and s/HTSU_Result/TTSU_Result/g though.

Anybody else got an opion on those? I personally don't have meaningful
feelings on this.  I'm mildly inclined to the renamings suggested
above.


> > If we were to go with 2) or 3), does anybody want to make a case for
> > renaming the HTSU_Result members? They've been confusing people for a
> > long while...
> >
> > In the patch it's currently:
> >
> > typedef enum
> > {
> > HeapTupleMayBeUpdated,  /* or deleted */
> > HeapTupleInvisible,
> > HeapTupleSelfUpdated,   /* or deleted */
> > HeapTupleUpdated,
> > HeapTupleDeleted,
> > HeapTupleBeingUpdated,  /* or deleted */
> > HeapTupleWouldBlock /* can be returned by 
> > heap_tuple_lock */
> > } HTSU_Result;
> 
> I think you're getting at the idea that HeapTupleMayBeUpdated really
> means NoProblemsFound, but I would be inclined to leave that alone.
> It's confusing, but the people who need to know what it means probably
> have the current name figured out, and would have to learn what some
> new name means.

That, and that HeapTupleMayBeUpdated, HeapTupleBeingUpdated also can
mean that the tuple is actually being deleted, not updated.  The patch
currently adds HeapTupleDeleted (which is currently subsumed in
HeapTupleUpdated), to allow callsites to distinguish between those two
cases - but we don't need callsites to distinguish between
HeapTupleMayBeUpdated / Deleted (there's no meaningful difference imo),
and HeapTupleBeingUpdated / Deleted currently also isn't necessary,
although there's certainly a meaningful difference.

Greetings,

Andres Freund



Re: Rare SSL failures on eelpout

2019-03-18 Thread Thomas Munro
On Tue, Mar 19, 2019 at 9:11 AM Tom Lane  wrote:
> I wrote:
> > ... I don't like pqHandleSendFailure all that much: it has strong
> > constraints on what state libpq has to be in, as a consequence of which
> > it's called from a bunch of ad-hoc places, and can't be called from
> > some others.  It's kind of accidental that it will work here.
> > I was toying with the idea of getting rid of that function in
> > favor of a design like this:
> > * Handle any socket-write failure at some low level of libpq by
> > recording the fact that the error occurred (plus whatever data we
> > need to produce a message about it), and then starting to just
> > discard output data.
> > * Eventually, we'll try to read.  Process any available input data
> > as usual (and, if we get a read error, report that).  When no more
> > input data is available, if a socket write failure has been recorded,
> > report that much as if it were an incoming ERROR message, and then
> > shut down the connection.
> > This would essentially give us pqHandleSendFailure-like behavior
> > across the board, which might be enough to fix this problem as well as
> > bug #15598.  Or not ... as you say, we haven't thoroughly understood
> > either issue, so it's possible this wouldn't do it.
>
> Here's a draft patch along that line.  It's gratifyingly small, and
> it does fix the SSL problem for me.  I have no way of investigating
> whether it fixes bug #15598, but it might.
>
> Aside from the SSL cert-verify issue, I've checked the behavior when
> the backend is shut down ("pg_ctl stop") between queries, and when
> the backend is kill 9'd mid-query.  The shutdown case reacts well with
> or without SSL:
>
> regression=# select 2+2;
> FATAL:  terminating connection due to administrator command
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The backend-crash case is fine without SSL:
>
> regression=# select * from tenk1 t1, tenk1 t2, tenk1 t3;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
>
> but a bit less nice with it:
>
> regression=# select * from tenk1 t1, tenk1 t2, tenk1 t3;
> SSL SYSCALL error: EOF detected
> The connection to the server was lost. Attempting reset: Succeeded.
>
> That seems cosmetic --- we just haven't expended the same effort on
> what to print for connection EOF with SSL as without it.  (Also,
> you get that same result without this patch.)
>
> Of course, this hardly counts as exhaustive testing, especially since
> I only tried one OpenSSL version.

Neat.  Fixes the problem for me on eelpout's host.

> > One thing that isn't real clear to me is how much timing sensitivity
> > there is in "when no more input data is available".  Can we assume that
> > if we've gotten ECONNRESET or an allied error from a write, then any
> > data the far end might've wished to send us is already available to
> > read?

Following a trail beginning at
https://en.wikipedia.org/wiki/Transmission_Control_Protocol I see that
RFC 1122 4.2.2.13 discusses this topic and possible variations in this
area.  I don't know enough about any of this stuff to draw hard
conclusions from primary sources, but it does appear that an
implementation might be within its rights to jettison that data,
unfortunately.  As seen with the simple Python script experiment
up-thread, we tested four implementations, and (amusingly) got four
different behaviours:

1.  For Windows, the answer to your question was clearly "no" in that
experiment.  I think this behaviour sucks for any protocol that
involves unsolicited messages in both directions (that is, not based
on pure request/response processing), because there is no way to
guarantee that you receive a "goodbye" message, and that's what's
happening in bug #15598.  I'm not sure if there is anything we ever
can do about this though, other than complaining on the internet or
opening a second TCP connection for unsolicited server->client
messages.

2.  Linux, FreeBSD and Darwin gave slightly different error sequences
when writing after the remote connection was closed (though I suspect
they'd behave much the same way for a connection to a remote host),
but all allowed the "goodbye" message to be read, so the answer there
is "yes".

> >  In principle, since TCP allows shutting down either transmission
> > direction independently, the server could close the read side of its
> > socket while continuing to transmit data.  In practice, PG servers
> > don't do that; but could network timing glitches create the same effect?
> > Even if it's possible, does it happen enough to worry about?
> > The reason I'm concerned is that I don't think it'd be bright to ignore a
> > send error until we se

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I came back to the question of contention recently. I don't think it's
okay to make contention worse in workloads where indexes are mostly
the same size as before, such as almost any workload that pgbench can
simulate. I have made a lot of the fact that the TPC-C indexes are
~40% smaller, in part because lots of people outside the community
find TPC-C interesting, and in part because this patch series is
focused on cases where we currently do unusually badly (cases where
good intuitions about how B-Trees are supposed to perform break down
[1]). These pinpointable problems must affect a lot of users some of
the time, but certainly not all users all of the time.

The patch series is actually supposed to *improve* the situation with
index buffer lock contention in general, and it looks like it manages
to do that with pgbench, which doesn't do inserts into indexes, except
for those required for non-HOT updates. pgbench requires relatively
few page splits, but is in every other sense a high contention
workload.

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

\set aid random_gaussian(1, 10 * :scale, 20)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

1st pass


(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7203.983289 (including connections establishing)
tps = 7204.020457 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.026 ms
1 client patch:
tps = 7012.575167 (including connections establishing)
tps = 7012.590007 (excluding connections establishing)
latency average = 0.143 ms
latency stddev = 0.020 ms

2 clients master:
tps = 13434.043832 (including connections establishing)
tps = 13434.076194 (excluding connections establishing)
latency average = 0.149 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13105.620223 (including connections establishing)
tps = 13105.654109 (excluding connections establishing)
latency average = 0.153 ms
latency stddev = 0.033 ms

8 clients master:
tps = 27126.852038 (including connections establishing)
tps = 27126.986978 (excluding connections establishing)
latency average = 0.295 ms
latency stddev = 0.095 ms
8 clients patch:
tps = 27945.457965 (including connections establishing)
tps = 27945.565242 (excluding connections establishing)
latency average = 0.286 ms
latency stddev = 0.089 ms

16 clients master:
tps = 32297.612323 (including connections establishing)
tps = 32297.743929 (excluding connections establishing)
latency average = 0.495 ms
latency stddev = 0.185 ms
16 clients patch:
tps = 33434.889405 (including connections establishing)
tps = 33435.021738 (excluding connections establishing)
latency average = 0.478 ms
latency stddev = 0.167 ms

64 clients master:
tps = 25699.029787 (including connections establishing)
tps = 25699.217022 (excluding connections establishing)
latency average = 2.482 ms
latency stddev = 1.715 ms
64 clients patch:
tps = 26513.816673 (including connections establishing)
tps = 26514.013638 (excluding connections establishing)
latency average = 2.405 ms
latency stddev = 1.690 ms

2nd pass


(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7172.995796 (including connections establishing)
tps = 7173.013472 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.022 ms
1 client patch:
tps = 7024.724365 (including connections establishing)
tps = 7024.739237 (excluding connections establishing)
latency average = 0.142 ms
latency stddev = 0.021 ms

2 clients master:
tps = 13489.016303 (including connections establishing)
tps = 13489.047968 (excluding connections establishing)
latency average = 0.148 ms
latency stddev = 0.032 ms
2 clients patch:
tps

pg_upgrade version checking questions

2019-03-18 Thread Tom Lane
While poking around trying to find an explanation for the pg_upgrade
failure described here:
https://www.postgresql.org/message-id/flat/CACmJi2JUhGo2ZxqDkh-EPHNjEN1ZA1S64uHLJFWHBhUuV4492w%40mail.gmail.com
I noticed a few things that seem a bit fishy about pg_upgrade.
I can't (yet) connect any of these to Tomasz' problem, but:

1. check_bin_dir() does validate_exec() for pg_dumpall and pg_dump,
but not for pg_restore, though pg_upgrade surely calls that too.
For that matter, it's not validating initdb and vacuumdb, though
it's grown dependencies on those as well.  Seems like there's little
point in checking these if we're not going to check all of them.

2. check_cluster_versions() insists that the target version be the
same major version as pg_upgrade itself, but is that really good enough?
As things stand, it looks like pg_upgrade 11.3 would happily use pg_dump
11.1, or vice versa.  With this rule, we cannot safely make any fixes
in minor releases that rely on synchronized changes in the behavior of
pg_upgrade and pg_dump/pg_dumpall/pg_restore.  I've not gone looking
to see if we've already made such changes in the past, but even if we
never have, that's a rather tight-looking straitjacket.  I think we
should insist that the new_cluster.bin_version be an exact match
to pg_upgrade's own PG_VERSION_NUM.

3. Actually, I'm kind of wondering why pg_upgrade has a --new-bindir
option at all, rather than just insisting on finding the new-version
executables in the same directory it is in.  This seems like, at best,
a hangover from before it got into core.  Even if you don't want to
remove the option, we could surely provide a useful default setting
based on find_my_exec.  (I'm amused to notice that pg_upgrade
currently takes the trouble to find out its own path, and then does
precisely nothing with the information.)

Thoughts?

regards, tom lane



Re: Rare SSL failures on eelpout

2019-03-18 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Mar 19, 2019 at 9:11 AM Tom Lane  wrote:
>> My current feeling is that this is OK to put in HEAD but I think the
>> risk-reward ratio isn't very good for the back branches.  Even with
>> an OpenSSL version where this makes a difference, the problematic
>> behavior is pretty hard to hit.  So I'm a bit inclined to do nothing
>> in the back branches.

> Shouldn't we also back-patch the one-line change adding
> pqHandleSendFailure()?

As I said before, I don't like that patch: at best it's an abuse of
pqHandleSendFailure, because that function is only meant to be called
at start of a query cycle.  It wouldn't be hard to break this usage and
not notice, especially given that we often don't test back-patched
changes very carefully in the back branches if they seem OK in HEAD.

Possibly we could consider back-patching the more aggressive patch
once it's survived v12 beta testing, and just living with the issue
till then.  Given what we know now, I don't think this is a big
problem for the field: how many people use SSL on local connections?

regards, tom lane



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan  wrote:
> With pgbench scale factor 20, here are results for patch and master
> with a Gaussian distribution on my 8 thread/4 core home server, with
> each run reported lasting 10 minutes, repeating twice for client
> counts 1, 2, 8, 16, and 64, patch and master branch:
>
> 1 client master:
> tps = 7203.983289 (including connections establishing)
> 1 client patch:
> tps = 7012.575167 (including connections establishing)
>
> 2 clients master:
> tps = 13434.043832 (including connections establishing)
> 2 clients patch:
> tps = 13105.620223 (including connections establishing)

Blech.  I think the patch has enough other advantages that it's worth
accepting that, but it's not great.  We seem to keep finding reasons
to reduce single client performance in the name of scalability, which
is often reasonable not but wonderful.

> However, this isn't completely
> free (particularly the page split stuff), and it doesn't pay for
> itself until the number of clients ramps up.

I don't really understand that explanation.  It makes sense that more
intelligent page split decisions could require more CPU cycles, but it
is not evident to me why more clients would help better page split
decisions pay off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2019-03-18 Thread Matheus de Oliveira
On Sun, Feb 3, 2019 at 8:28 AM Andres Freund  wrote:

>
> >
> > It compiled, worked as expected, but some tests broke executing make
> check:
> >
> > test create_table ... FAILED
> >  constraints  ... FAILED
> >  inherit  ... FAILED
> >  foreign_data ... FAILED
> >  alter_table  ... FAILED
> >
> > I didn't test the bugfix, just the v3 patch.
>
> Given that the patch hasn't been updated since, and the CF has ended,
> I'm marking this patch as returned with feedback. Please resubmit once
> that's fixed.
>
>
Hi all.

Sorry for the long delay. I've rebased the patch to current master (at
f2004f19ed now), attached as postgresql-alter-constraint.v4.patch. All
tests passed cleanly.

Best regards,
-- 
Matheus de Oliveira
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index e360728c02..b2866f467a 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -55,7 +55,9 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
 ADD table_constraint [ NOT VALID ]
 ADD table_constraint_using_index
-ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+ALTER CONSTRAINT constraint_name
+  [ ON DELETE action ] [ ON UPDATE action ]
+  [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
 VALIDATE CONSTRAINT constraint_name
 DROP CONSTRAINT [ IF EXISTS ]  constraint_name [ RESTRICT | CASCADE ]
 DISABLE TRIGGER [ trigger_name | ALL | USER ]
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 515c29072c..56847b90f9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -8305,8 +8305,43 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
  errmsg("constraint \"%s\" of relation \"%s\" is not a foreign key constraint",
 		cmdcon->conname, RelationGetRelationName(rel;
 
+	/*
+	 * Verify for FKCONSTR_ACTION_UNKNOWN, if found, replace by current
+	 * action. We could handle FKCONSTR_ACTION_UNKNOWN bellow, but since
+	 * we already have to handle the case of changing to the same action,
+	 * seems simpler to simple replace FKCONSTR_ACTION_UNKNOWN by the
+	 * current action here.
+	 */
+	if (cmdcon->fk_del_action == FKCONSTR_ACTION_UNKNOWN)
+		cmdcon->fk_del_action = currcon->confdeltype;
+
+	if (cmdcon->fk_upd_action == FKCONSTR_ACTION_UNKNOWN)
+		cmdcon->fk_upd_action = currcon->confupdtype;
+
+	/*
+	 * Do the same for deferrable attributes. But consider that if changed
+	 * only initdeferred attribute and to true, force deferrable to be also
+	 * true. On the other hand, if changed only deferrable attribute and to
+	 * false, force initdeferred to be also false.
+	 */
+	if (!cmdcon->was_deferrable_set)
+		cmdcon->deferrable = cmdcon->initdeferred ? true : currcon->condeferrable;
+
+	if (!cmdcon->was_initdeferred_set)
+		cmdcon->initdeferred = !cmdcon->deferrable ? false : currcon->condeferred;
+
+	/*
+	 * This is a safe check only, should never happen here.
+	 */
+	if (cmdcon->initdeferred && !cmdcon->deferrable)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("constraint declared INITIALLY DEFERRED must be DEFERRABLE")));
+
 	if (currcon->condeferrable != cmdcon->deferrable ||
-		currcon->condeferred != cmdcon->initdeferred)
+		currcon->condeferred != cmdcon->initdeferred ||
+		currcon->confdeltype != cmdcon->fk_del_action ||
+		currcon->confupdtype != cmdcon->fk_upd_action)
 	{
 		HeapTuple	copyTuple;
 		HeapTuple	tgtuple;
@@ -8324,6 +8359,8 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
 		copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
 		copy_con->condeferrable = cmdcon->deferrable;
 		copy_con->condeferred = cmdcon->initdeferred;
+		copy_con->confdeltype = cmdcon->fk_del_action;
+		copy_con->confupdtype = cmdcon->fk_upd_action;
 		CatalogTupleUpdate(conrel, ©Tuple->t_self, copyTuple);
 
 		InvokeObjectPostAlterHook(ConstraintRelationId,
@@ -8360,23 +8397,106 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
 otherrelids = list_append_unique_oid(otherrelids,
 	 tgform->tgrelid);
 
-			/*
-			 * Update deferrability of RI_FKey_noaction_del,
-			 * RI_FKey_noaction_upd, RI_FKey_check_ins and RI_FKey_check_upd
-			 * triggers, but not others; see createForeignKeyTriggers and
-			 * CreateFKCheckTrigger.
-			 */
-			if (tgform->tgfoid != F_RI_FKEY_NOACTION_DEL &&
-tgform->tgfoid != F_RI_FKEY_NOACTION_UPD &&
-tgform->tgfoid != F_RI_FKEY_CHECK_INS &&
-tgform->tgfoid != F_RI_FKEY_CHECK_UPD)
-continue;
-
 			copyTuple = heap_copytuple(tgtuple);
 			copy_tg = (Form_pg_trigger) GETSTRUCT(copyTuple);
 
+			/*
+			 * Set deferrability here, but note that it may be overridden bellow
+			 * if the pg_trigger entry is on the referencing tabl

Re: Progress reporting for pg_verify_checksums

2019-03-18 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 18 Mar 2019 23:14:01 +0100 (CET), Fabien COELHO  
wrote in 
> 
> > I have rebased it now.
> 
> Thanks. Will look at it.
> 
> >> If the all of aboves are involved, the line would look as the
> >> follows.
> >>
> >> [=== ] ( 63% of 12.53 GB, 179 MB/s,
> >> ETC 26s)
> >>
> >> # Note that this is just an opinion.
> >>
> >> (pg_checksum runs fast at the beginning so ETC behaves somewhat
> >> strange in the meanwhile.)
> >
> > I haven't changed that for now as it seems to be a bit more involved.
> > I'd like to hear other opinions on whether that is worthwhile?
> I think that the bar is overkill, but ETC is easy and nice.

I'm fine with that.

> >>> + /* we handle SIGUSR1 only, and toggle the value of show_progress
> >>> */
> >>> + if (signum == SIGUSR1)
> >>> + show_progress = !show_progress;
> >>
> >> SIGUSR1 *toggles* progress.
> >
> > Not sure what you mean here,
> 
> Probably it is meant to simplify the comment?

Sorry. I meant that "it can be turned off and a perhaps-garbage
line is left alone after turning off. Don't we need to erase it?".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 5:00 PM Robert Haas  wrote:
> Blech.  I think the patch has enough other advantages that it's worth
> accepting that, but it's not great.  We seem to keep finding reasons
> to reduce single client performance in the name of scalability, which
> is often reasonable not but wonderful.

The good news is that the quicksort that we now perform in
nbtsplitloc.c is not optimized at all. Heikki thought it premature to
optimize that, for example by inlining/specializing the quicksort. I
can make that 3x faster fairly easily, which could easily change the
picture here. The code will be uglier that way, but not much more
complicated. I even prototyped this, and managed to make serial
microbenchmarks I've used noticeably faster. This could very well fix
the problem here. It clearly showed up in perf profiles with serial
bulks loads.

> > However, this isn't completely
> > free (particularly the page split stuff), and it doesn't pay for
> > itself until the number of clients ramps up.
>
> I don't really understand that explanation.  It makes sense that more
> intelligent page split decisions could require more CPU cycles, but it
> is not evident to me why more clients would help better page split
> decisions pay off.

Smarter choices on page splits pay off with higher client counts
because they reduce contention at likely hot points. It's kind of
crazy that the code in _bt_check_unique() sometimes has to move right,
while holding an exclusive buffer lock on the original page and a
shared buffer lock on its sibling page at the same time. It then has
to hold a third buffer lock concurrently, this time on any heap pages
it is interested in. Each in turn, to check if they're possibly
conflicting. gcov shows that that never happens with the regression
tests once the patch is applied (you can at least get away with only
having one buffer lock on a leaf page at all times in practically all
cases).

-- 
Peter Geoghegan



Re: Rare SSL failures on eelpout

2019-03-18 Thread Thomas Munro
On Tue, Mar 19, 2019 at 12:25 PM Thomas Munro  wrote:
> 2.  Linux, FreeBSD and Darwin gave slightly different error sequences
> when writing after the remote connection was closed (though I suspect
> they'd behave much the same way for a connection to a remote host),
> but all allowed the "goodbye" message to be read, so the answer there
> is "yes".

Qualification, after some off-list discussion with Andrew Gierth: even
though that appears to work in happy tests on those TCP
implementations, it may be that they can't really guarantee that it
works in adverse conditions, because the server may not be able to
retransmit packets that preceded the reset but somehow went missing,
and it's also not clear what happens if they arrive out of order.
Dunno.  (Oh man, all this hidden magic, is it actually possible to
write working application code without imbibing Stevens TCP/IP
Illustrated?  It's a bit like last year's work on filesystem quirks,
when I realised that man pages are nice and all, but you basically
need to read the source or talk to someone who has...)

-- 
Thomas Munro
https://enterprisedb.com



Re: Rare SSL failures on eelpout

2019-03-18 Thread Thomas Munro
On Tue, Mar 19, 2019 at 12:44 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Shouldn't we also back-patch the one-line change adding
> > pqHandleSendFailure()?
>
> As I said before, I don't like that patch: at best it's an abuse of
> pqHandleSendFailure, because that function is only meant to be called
> at start of a query cycle.  It wouldn't be hard to break this usage and
> not notice, especially given that we often don't test back-patched
> changes very carefully in the back branches if they seem OK in HEAD.
>
> Possibly we could consider back-patching the more aggressive patch
> once it's survived v12 beta testing, and just living with the issue
> till then.  Given what we know now, I don't think this is a big
> problem for the field: how many people use SSL on local connections?

Yeah, now that we understand this properly I agree this is unlikely to
bother anyone in real life.  I just want to make the build farm green.
I wondered about ssl_max_protocol_version = 'TLSv1.2', but that GUC's
too new.  Another option would be to change the "command_fails_like"
pattern to tolerate both errors in v11.

-- 
Thomas Munro
https://enterprisedb.com



Re: [HACKERS] Block level parallel vacuum

2019-03-18 Thread Masahiko Sawada
On Tue, Mar 19, 2019 at 3:05 AM Robert Haas  wrote:
>
> On Thu, Mar 14, 2019 at 3:37 AM Masahiko Sawada  wrote:
> > BTW your patch seems to not apply to the current HEAD cleanly and to
> > need to update the comment of vacuum().
>
> Yeah, I omitted some hunks by being stupid with 'git'.
>
> Since you seem to like the approach, I put back the hunks I intended
> to have there, pulled in one change from your v2 that looked good,
> made one other tweak, and committed this.

Thank you!

>   I think I like what I did
> with vacuum_open_relation a bit better than what you did; actually, I
> think it cannot be right to just pass 'params' when the current code
> is passing params->options & ~(VACOPT_VACUUM).  My approach avoids
> that particular pitfall.

Agreed. Thanks.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Rare SSL failures on eelpout

2019-03-18 Thread Tom Lane
Thomas Munro  writes:
> Yeah, now that we understand this properly I agree this is unlikely to
> bother anyone in real life.  I just want to make the build farm green.
> I wondered about ssl_max_protocol_version = 'TLSv1.2', but that GUC's
> too new.

Yeah; also, forcing that would reduce our test coverage.

> Another option would be to change the "command_fails_like"
> pattern to tolerate both errors in v11.

This might be a reasonable stopgap measure, since it'd basically
only affect buildfarm results.

regards, tom lane



RE: speeding up planning with partitions

2019-03-18 Thread Imai, Yoshikazu
Amit-san,

On Mon, Mar 18, 2019 at 9:56 AM, Amit Langote wrote:
> On 2019/03/15 14:40, Imai, Yoshikazu wrote:
> > Amit-san,
> >
> > I have another little comments about v31-patches.
> >
> > * We don't need is_first_child in inheritance_planner().
> >
> > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> >> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> >>> I attached the diff of modification for v26-0003 patch which also
> >> contains some refactoring.
> >>> Please see if it is ok.
> >>
> >> I like the is_first_child variable which somewhat improves
> >> readability, so updated the patch to use it.
> >
> > I noticed that detecting first child query in inheritance_planner()
> is already done by "final_rtable == NIL"
> >
> > /*
> >  * If this is the first non-excluded child, its post-planning
> rtable
> >  * becomes the initial contents of final_rtable; otherwise,
> append
> >  * just its modified subquery RTEs to final_rtable.
> >  */
> > if (final_rtable == NIL)
> >
> > So I think we can use that instead of using is_first_child.
> 
> That's a good suggestion.  One problem I see with that is that
> final_rtable is set only once we've found the first *non-dummy* child.

Ah... I overlooked that.

> So, if all the children except the last one are dummy, we'd end up never
> reusing the source child objects.  Now, if final_rtable was set for the
> first child irrespective of whether it turned out to be dummy or not,
> which it seems OK to do, then we can implement your suggestion.

I thought you mean we can remove or move below code to under setting 
final_rtable.

/*
 * If this child rel was excluded by constraint exclusion, exclude it
 * from the result plan.
 */
if (IS_DUMMY_REL(sub_final_rel))
continue;

It seems logically ok, but I also thought there are the case where useless 
process happens.

If we execute below update statement, final_rtable would be unnecessarily 
expanded by adding placeholder.

* partition table rt with 1024 partitions.
* partition table pt with 2 partitions.
* update rt set c = ss.c from ( select b from pt union all select b + 3 from 
pt) ss where a = 1024 and rt.b = ss.b;


After all, I think it might be better to use is_first_child because the meaning 
of is_first_child and final_rtable is different.
is_first_child == true represents that we currently processing first child 
query and final_rtable == NIL represents we didn't find first non-excluded 
child query.

--
Yoshikazu Imai


Re: Libpq support to connect to standby server as priority

2019-03-18 Thread Haribabu Kommi
On Thu, Feb 28, 2019 at 1:00 PM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> > Attached are the updated patches.
>
> Thanks, all look fixed.
>
>
> > The target_server_type option yet to be implemented.
>
> Please let me review once more and proceed to testing when the above is
> added, to make sure the final code looks good.  I'd like to see how complex
> the if conditions in multiple places would be after adding
> target_server_type, and consider whether we can simplify them together with
> you.  Even now, the if conditions seem complicated to me... that's probably
> due to the existence of read_write_host_index.
>

Yes, if checks are little bit complex because of additional checks to
identify, I will check if there is
any easier way to update them without introducing code duplication.

While working on implementation of target_server_type new connection option
for the libpq
to specify master, slave and etc, there is no problem when the newly added
target_server_type
option is used separate, but when it is combined with the existing
target_session_attrs, there
may be some combinations that are not valid or such servers doesn't exist.

Target_session_attrs  Target_server_type

read-write   prefer-slave, slave
prefer-read master, slave
read-onlymaster, prefer-slave

I know that some of the cases above is possible, like master server with by
default accepts
read-only sessions. Instead of we put a check to validate what is right
combination, how
about allowing the combinations and in case if such combination is not
possible, means
there shouldn't be any server the supports the requirement, and connection
fails.

comments?

And also as we need to support the new option to connect to servers < 12
also, this option
sends the command "select pg_is_in_recovery()" to the server to find out
whether the server
is recovery mode or not?

And also regarding the implementation point of view, the new
target_server_type option
validation is separately handled, means the check for the required server
is handled in a separate
switch case, when both options are given, first find out the required
server for target_session_attrs
and validate the same again with target_server_type?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: [HACKERS] Block level parallel vacuum

2019-03-18 Thread Haribabu Kommi
On Mon, Mar 18, 2019 at 1:58 PM Masahiko Sawada 
wrote:

> On Tue, Feb 26, 2019 at 7:20 PM Masahiko Sawada 
> wrote:
> >
> > On Tue, Feb 26, 2019 at 1:35 PM Haribabu Kommi 
> wrote:
> > >
> > > On Thu, Feb 14, 2019 at 9:17 PM Masahiko Sawada 
> wrote:
> > >>
> > >> Thank you. Attached the rebased patch.
> > >
> > >
> > > I ran some performance tests to compare the parallelism benefits,
> >
> > Thank you for testing!
> >
> > > but I got some strange results of performance overhead, may be it is
> > > because, I tested it on my laptop.
> >
> > Hmm, I think the parallel vacuum would help for heavy workloads like a
> > big table with multiple indexes. In your test result, all executions
> > are completed within 1 sec, which seems to be one use case that the
> > parallel vacuum wouldn't help. I suspect that the table is small,
> > right? Anyway I'll also do performance tests.
> >
>
> Here is the performance test results. I've setup a 500MB table with
> several indexes and made 10% of table dirty before each vacuum.
> Compared execution time of the patched postgrse with the current HEAD
> (at 'speed_up' column). In my environment,
>
>  indexes | parallel_degree |  patched   |head| speed_up
> -+-+++--
>   0 |   0 |   238.2085 |   244.7625 |   1.0275
>   0 |   1 |   237.7050 |   244.7625 |   1.0297
>   0 |   2 |   238.0390 |   244.7625 |   1.0282
>   0 |   4 |   238.1045 |   244.7625 |   1.0280
>   0 |   8 |   237.8995 |   244.7625 |   1.0288
>   0 |  16 |   237.7775 |   244.7625 |   1.0294
>   1 |   0 |  1328.8590 |  1334.9125 |   1.0046
>   1 |   1 |  1325.9140 |  1334.9125 |   1.0068
>   1 |   2 |  1333.3665 |  1334.9125 |   1.0012
>   1 |   4 |  1329.5205 |  1334.9125 |   1.0041
>   1 |   8 |  1334.2255 |  1334.9125 |   1.0005
>   1 |  16 |  1335.1510 |  1334.9125 |   0.9998
>   2 |   0 |  2426.2905 |  2427.5165 |   1.0005
>   2 |   1 |  1416.0595 |  2427.5165 |   1.7143
>   2 |   2 |  1411.6270 |  2427.5165 |   1.7197
>   2 |   4 |  1411.6490 |  2427.5165 |   1.7196
>   2 |   8 |  1410.1750 |  2427.5165 |   1.7214
>   2 |  16 |  1413.4985 |  2427.5165 |   1.7174
>   4 |   0 |  4622.5060 |  4619.0340 |   0.9992
>   4 |   1 |  2536.8435 |  4619.0340 |   1.8208
>   4 |   2 |  2548.3615 |  4619.0340 |   1.8126
>   4 |   4 |  1467.9655 |  4619.0340 |   3.1466
>   4 |   8 |  1486.3155 |  4619.0340 |   3.1077
>   4 |  16 |  1481.7150 |  4619.0340 |   3.1174
>   8 |   0 |  9039.3810 |  8990.4735 |   0.9946
>   8 |   1 |  4807.5880 |  8990.4735 |   1.8701
>   8 |   2 |  3786.7620 |  8990.4735 |   2.3742
>   8 |   4 |  2924.2205 |  8990.4735 |   3.0745
>   8 |   8 |  2684.2545 |  8990.4735 |   3.3493
>   8 |  16 |  2672.9800 |  8990.4735 |   3.3635
>  16 |   0 | 17821.4715 | 17740.1300 |   0.9954
>  16 |   1 |  9318.3810 | 17740.1300 |   1.9038
>  16 |   2 |  7260.6315 | 17740.1300 |   2.4433
>  16 |   4 |  5538.5225 | 17740.1300 |   3.2030
>  16 |   8 |  5368.5255 | 17740.1300 |   3.3045
>  16 |  16 |  5291.8510 | 17740.1300 |   3.3523
> (36 rows)
>

The performance results are good. Do we want to add the recommended
size in the document for the parallel option? the parallel option for
smaller
tables can lead to performance overhead.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: [HACKERS] CLUSTER command progress monitor

2019-03-18 Thread Tatsuro Yamada

Hi Rafia!

On 2019/03/18 20:42, Rafia Sabih wrote:

On Fri, 8 Mar 2019 at 09:14, Tatsuro Yamada
 wrote:

Attached file is rebased patch on current HEAD.
I changed a status. :)



Looks like the patch needs a rebase.
I was on the commit fb5806533f9fe0433290d84c9b019399cd69e9c2

PFA reject file in case you want to have a look.


Thanks for testing it. :)
I rebased the patch on the current head: 
f2004f19ed9c9228d3ea2b12379ccb4b9212641f.

Please find attached file.

Also, I share my test case of progress monitor below.

=== My test case ===

[Terminal1]
Run this query on psql:

   \a \t
   select * from pg_stat_progress_cluster; \watch 0.05

[Terminal2]
Run these queries on psql:

drop table t1;

create table t1 as select a, random() * 1000 as b from generate_series(0, 
99) a;
create index idx_t1 on t1(a);
create index idx_t1_b on t1(b);
analyze t1;

-- index scan
set enable_seqscan to off;
cluster verbose t1 using idx_t1;

-- seq scan
set enable_seqscan to on;
set enable_indexscan to off;
cluster verbose t1 using idx_t1;

-- only given table name to cluster command
cluster verbose t1;

-- only cluster command
cluster verbose;

-- vacuum full
vacuum full t1;

-- vacuum full
vacuum full;




Regards,
Tatsuro Yamada









diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c339a2bb77..8a634dd57e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -52,6 +52,7 @@
 #include "catalog/storage.h"
 #include "commands/tablecmds.h"
 #include "commands/event_trigger.h"
+#include "commands/progress.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -59,6 +60,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
 #include "parser/parser.h"
+#include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -3851,6 +3853,7 @@ reindex_relation(Oid relid, int flags, int options)
 	List	   *indexIds;
 	bool		is_pg_class;
 	bool		result;
+	int			i;
 
 	/*
 	 * Open and lock the relation.  ShareLock is sufficient since we only need
@@ -3938,6 +3941,7 @@ reindex_relation(Oid relid, int flags, int options)
 
 		/* Reindex all the indexes. */
 		doneIndexes = NIL;
+		i = 1;
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
@@ -3955,6 +3959,11 @@ reindex_relation(Oid relid, int flags, int options)
 
 			if (is_pg_class)
 doneIndexes = lappend_oid(doneIndexes, indexOid);
+
+			/* Set index rebuild count */
+			pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+		 i);
+			i++;
 		}
 	}
 	PG_CATCH();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d962648bc5..87c0092787 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -907,6 +907,32 @@ CREATE VIEW pg_stat_progress_vacuum AS
 FROM pg_stat_get_progress_info('VACUUM') AS S
 		LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_progress_cluster AS
+SELECT
+S.pid AS pid,
+S.datid AS datid,
+D.datname AS datname,
+S.relid AS relid,
+CASE S.param1 WHEN 1 THEN 'CLUSTER'
+  WHEN 2 THEN 'VACUUM FULL'
+  END AS command,
+CASE S.param2 WHEN 0 THEN 'initializing'
+  WHEN 1 THEN 'seq scanning heap'
+  WHEN 2 THEN 'index scanning heap'
+  WHEN 3 THEN 'sorting tuples'
+  WHEN 4 THEN 'writing new heap'
+  WHEN 5 THEN 'swapping relation files'
+  WHEN 6 THEN 'rebuilding index'
+  WHEN 7 THEN 'performing final cleanup'
+  END AS phase,
+S.param3 AS cluster_index_relid,
+S.param4 AS heap_tuples_scanned,
+S.param5 AS heap_blks_total,
+S.param6 AS heap_blks_scanned,
+S.param7 AS index_rebuild_count
+FROM pg_stat_get_progress_info('CLUSTER') AS S
+LEFT JOIN pg_database D ON S.datid = D.oid;
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3e2a807640..478894c869 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -36,10 +36,12 @@
 #include "catalog/objectaccess.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
+#include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -276,6 +278,8 @@ cluster_rel(Oid tableOid, Oid indexOid, int options)
 	/* Check for user-requested abort. */
 	CHECK_FOR_INTERRUPTS();
 
+	pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
 	/*
 	 * We grab exclusive access t

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan  wrote:
> Smarter choices on page splits pay off with higher client counts
> because they reduce contention at likely hot points. It's kind of
> crazy that the code in _bt_check_unique() sometimes has to move right,
> while holding an exclusive buffer lock on the original page and a
> shared buffer lock on its sibling page at the same time. It then has
> to hold a third buffer lock concurrently, this time on any heap pages
> it is interested in.

Actually, by the time we get to 16 clients, this workload does make
the indexes and tables smaller. Here is pg_buffercache output
collected after the first 16 client case:

Master
==

 relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─┼───┼───┼──┼
 pgbench_history │ 0 │
  123,484 │  123,484 │ 4.9989715266755207
 pgbench_accounts│ 0 │
   34,665 │   10,682 │ 4.4948511514697622
 pgbench_accounts_pkey   │ 0 │
5,708 │1,561 │ 4.8731582319026265
 pgbench_tellers │ 0 │
  489 │  489 │ 5.
 pgbench_branches│ 0 │
  284 │  284 │ 5.
 pgbench_tellers_pkey│ 0 │
   56 │   56 │ 5.


Patch
=

 relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─┼───┼───┼──┼
 pgbench_history │ 0 │
  127,864 │  127,864 │ 4.9980447975974473
 pgbench_accounts│ 0 │
   33,933 │9,614 │ 4.3517786561264822
 pgbench_accounts_pkey   │ 0 │
5,487 │1,322 │ 4.8857791225416036
 pgbench_tellers │ 0 │
  204 │  204 │ 4.9803921568627451
 pgbench_branches│ 0 │
  198 │  198 │ 4.3535353535353535
 pgbench_tellers_pkey│ 0 │
   14 │   14 │ 5.


The main fork for pgbench_history is larger with the patch, obviously,
but that's good. pgbench_accounts_pkey is about 4% smaller, which is
probably the most interesting observation that can be made here, but
the tables are also smaller. pgbench_accounts itself is ~2% smaller.
pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller.
Of course, the smaller tables were already very small, so maybe that
isn't important. I think that this is due to more effective pruning,
possibly because we get better lock arbitration as a consequence of
better splits, and also because duplicates are in heap TID order. I
haven't observed this effect with larger databases, which have been my
focus.

It isn't weird that shared_buffers doesn't have all the
pgbench_accounts blocks, since, of course, this is highly skewed by
design -- most blocks were never accessed from the table.

This effect seems to be robust, at least with this workload. The
second round of benchmarks (which have their own pgbench -i
initialization) show very similar amounts of bloat at the same point.
It may not be that significant, but it's also not a fluke.

-- 
Peter Geoghegan


Re: [HACKERS] CLUSTER command progress monitor

2019-03-18 Thread Tatsuro Yamada

On 2019/03/19 10:43, Tatsuro Yamada wrote:

Hi Rafia!

On 2019/03/18 20:42, Rafia Sabih wrote:

On Fri, 8 Mar 2019 at 09:14, Tatsuro Yamada
 wrote:

Attached file is rebased patch on current HEAD.
I changed a status. :)



Looks like the patch needs a rebase.
I was on the commit fb5806533f9fe0433290d84c9b019399cd69e9c2

PFA reject file in case you want to have a look.


Thanks for testing it. :)
I rebased the patch on the current head: 
f2004f19ed9c9228d3ea2b12379ccb4b9212641f.

Please find attached file.

Also, I share my test case of progress monitor below.



Attached patch is a rebased document patch. :)


Thanks,
Tatsuro Yamada

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index ac2721c8ad..79d98bb601 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -344,6 +344,14 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  pg_stat_progress_clusterpg_stat_progress_cluster
+  One row for each backend running
+   CLUSTER or VACUUM FULL, showing current progress.
+   See .
+  
+ 
+
 

   
@@ -3394,9 +3402,9 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
   
PostgreSQL has the ability to report the progress of
-   certain commands during command execution.  Currently, the only command
-   which supports progress reporting is VACUUM.  This may be
-   expanded in the future.
+   certain commands during command execution.  Currently, the only commands which 
+   support progress reporting are VACUUM and
+   CLUSTER. This may be expanded in the future.
   
 
  
@@ -3408,9 +3416,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
one row for each backend (including autovacuum worker processes) that is
currently vacuuming.  The tables below describe the information
that will be reported and provide information about how to interpret it.
-   Progress reporting is not currently supported for VACUUM FULL
-   and backends running VACUUM FULL will not be listed in this
-   view.
+   Running VACUUM FULL is listed in pg_stat_progress_cluster
+   because both VACUUM FULL and CLUSTER 
+   rewrite the table, while regular VACUUM only modifies it 
+   in place. See .
   
 
   
@@ -3587,6 +3596,218 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,

   
 
+ 
+
+ 
+  CLUSTER Progress Reporting
+
+  
+   Whenever CLUSTER is running, the
+   pg_stat_progress_cluster view will contain
+   a row for each backend that is currently running CLUSTER or VACUUM FULL. 
+   The tables below describe the information that will be reported and
+   provide information about how to interpret it.
+  
+
+  
+   pg_stat_progress_cluster View
+   
+
+
+  Column
+  Type
+  Description
+ 
+
+
+   
+
+ pid
+ integer
+ Process ID of backend.
+
+
+ datid
+ oid
+ OID of the database to which this backend is connected.
+
+
+ datname
+ name
+ Name of the database to which this backend is connected.
+
+
+ relid
+ oid
+ OID of the table being clustered.
+
+
+ command
+ text
+ 
+   The command that is running. Either CLUSTER or VACUUM FULL.
+ 
+
+
+ phase
+ text
+ 
+   Current processing phase. See  or .
+ 
+
+
+ cluster_index_relid
+ bigint
+ 
+   If the table is being scanned using an index, this is the OID of the
+   index being used; otherwise, it is zero.
+ 
+
+
+ heap_tuples_scanned
+ bigint
+ 
+   Number of heap tuples scanned.
+   This counter only advances when the phase is seq scanning heap, 
+   index scanning heap and writing new heap.
+ 
+
+
+ heap_blks_total
+ bigint
+ 
+   Total number of heap blocks in the table.  This number is reported
+   as of the beginning of seq scanning heap.
+ 
+
+
+ heap_blks_scanned
+ bigint
+ 
+   Number of heap blocks scanned. 
+   This counter only advances when the phase is seq scanning heap.
+ 
+
+
+ index_rebuild_count
+ bigint
+ 
+   Number of rebuilded indexes.
+   This counter only advances when the phase is rebuilding index.
+ 
+
+   
+   
+  
+
+  
+   CLUSTER phases
+   
+
+
+  Phase
+  Description
+ 
+
+
+   
+
+ initializing
+ 
+   CLUSTER is preparing to begin scanning the heap.  This
+   phase is expected to be very brief.
+ 
+
+
+ seq scanning heap
+ 
+   CLUSTER is currently scanning heap from the table by
+   seq scan.
+ 
+
+
+ index scanning heap
+ 
+   CLUSTER is currently scanning heap from the table by
+   index scan.
+ 
+
+
+ sorting tuples
+ 
+   CLUSTER is currently sorting tuples. 
+ 
+
+
+ swapping relation files
+ 
+   CLUSTER is currently

extensions are hitting the ceiling

2019-03-18 Thread Eric Hanson
Hi folks,

After months and years of really trying to make EXTENSIONs meet the
requirements of my machinations, I have come to the conclusion that either
a) I am missing something or b) they are architecturally flawed.  Or
possibly both.

Admittedly, I might be trying to push extensions beyond what the great
elephant in the sky ever intended. The general bent here is to try to
achieve a level of modular reusable components similar to those in
"traditional" programming environments like pip, gem, npm, cpan, etc.
Personally, I am trying to migrate as much of my dev stack as possible away
from the filesystem and into the database. Files, especially code,
configuration, templates, permissions, manifests and other development
files, would be much happier in a database where they have constraints and
an information model and can be queried!

Regardless, it would be really great to be able to install an extension,
and have it cascade down to multiple other extensions, which in turn
cascade down to more, and have everything just work. Clearly, this was
considered in the extension architecture, but I'm running into some
problems making it a reality.  So here they are.


#1: Dependencies

Let's say we have two extensions, A and B, both of which depend on a third
extension C, let's just say C is hstore.  A and B are written by different
developers, and both contain in their .control file the line

requires = 'hstore'

When A is installed, if A creates a schema, it puts hstore in that schema.
If not, hstore is already installed, it uses it in that location.  How does
the extension know where to reference hstore?

Then, when B is installed, it checks to see if extension hstore is
installed, sees that it is, and moves on.  What if it expects it in a
different place than A does? The hstore extension can only be installed
once, in a single schema, but if multiple extensions depend on it and look
for it in different places, they are incompatible.

I have heard talk of a way to write extensions so that they dynamically
reference the schema of their dependencies, but sure don't know how that
would work if it's possible.  The @extschema@ variable references the
*current* extension's schema, but not there is no dynamic variable to
reference the schema of a dependency.

Also it is possible in theory to dynamically set search_path to contain
every schema of every dependency in play and then just not specify a schema
when you use something in a dependency.  But this ANDs together all the
scopes of all the dependencies of an extension, introducing potential for
collisions, and is generally kind of clunky.


#2:  Data in Extensions

Extensions that are just a collection of functions and types seem to be the
norm.  Extensions can contain what the docs call "configuration" data, but
rows are really second class citizens:  They aren't tracked with
pg_catalog.pg_depend, they aren't deleted when the extension is dropped,
etc.

Sometimes it would make sense for an extension to contain *only* data, or
insert some rows in a table that the extension doesn't "own", but has as a
dependency.  For example, a "webserver" extension might contain a
"resource" table that serves up the content of resources in the table at a
specified path. But then, another extension, say an app, might want to just
list the webserver extension as a dependency, and insert a few resource
rows into it.  This is really from what I can tell beyond the scope of what
extensions are capable of.


#3 pg_dump and Extensions

Tables created by extensions are skipped by pg_dump unless they are flagged
at create time with:

pg_catalog.pg_extension_config_dump('my_table', 'where id < 20')

However, there's no way that I can tell to mix and match rows and tables
across multiple extensions, so pg_dump can't keep track of multiple
extensions that contain rows in the same table.


I'd like an extension framework that can contain data as first class
citizens, and can gracefully handle a dependency chain and share
dependencies.  I have some ideas for a better approach, but they are pretty
radical.  I thought I would send this out and see what folks think.

Thanks,
Eric
--
http://aquameta.org/


Re: Problem with default partition pruning

2019-03-18 Thread Kyotaro HORIGUCHI
Hello.

At Fri, 15 Mar 2019 17:30:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190315.173007.147577546.horiguchi.kyot...@lab.ntt.co.jp>
> The patch relies on the fact(?) that the lowest index is always
> -1 in range partition and uses it as pseudo default
> partition. I'm not sure it is really the fact and anyway it
> donsn't seem the right thing to do. Could you explain how it
> works, not what you did in this patch?

I understood how it works but still uneasy that only list
partitioning requires scan_default. Anyway please ignore this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [HACKERS] Cached plans and statement generalization

2019-03-18 Thread Yamaji, Ryo

On Tue, Jan 29, 2019 at 10:46 AM, Konstantin Knizhnik wrote:
> Rebased version of the patch is attached.

I'm sorry for the late review.
I confirmed behavior of autoprepare-12.patch. It is summarized below.

・parameter
Expected behavior was shown according to the set value.
However, I think that it is be more kind to describe that autoprepare
hold infinite statements when the setting value of
autoprepare_ (memory_) limit is 0 in the manual.
There is no problem as operation.

・pg_autoprepared_statements
I confirmed that I could refer properly.

・autoprepare cache retention period
I confirmed that autoprepared statements were deleted when the set
statement or the DDL statement was executed. Although it differs from
the explicit prepare statements, it does not matter as a autoprepare.

・performance
This patch does not confirm the basic performance of autoprepare because
it confirmed that there was no performance problem with the previous
patch (autoprepare-11.patch). However, because it was argued that
performance degradation would occur when prepared statements execute to
a partition table, I expected that autoprepare might exhibit similar
behavior, and measured the performance. I also predicted that the
plan_cache_mode setting does not apply to autoprepare, and we also
measured the plan_cache_mode by conditions.
Below results (this result is TPS)

plan_cache_mode simple  simple(autoprepare) prepare
auto130 121 121.5
force_custom_plan   132.5   90.7122.7
force_generic_plan  126.7   14.724.7

Performance degradation was observed when plan_cache_mode was specified
for autoprepare. Is this behavior correct? I do not know why this is the
results.

Below performance test procedure

drop table if exists rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(0, 1024) x;
\gexec
\o

pgbench -p port -T 60 -c 1 -n -f test.sql (-M prepared) postgres

test.sql
\set a random (0, 1023)
select * from rt where a = :a;


RE: Libpq support to connect to standby server as priority

2019-03-18 Thread Tsunakawa, Takayuki
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> Target_session_attrs  Target_server_type
> 
> read-write   prefer-slave, slave
> 
> prefer-read master, slave
> read-onlymaster, prefer-slave
> 
> I know that some of the cases above is possible, like master server with
> by default accepts
> read-only sessions. Instead of we put a check to validate what is right
> combination, how
> about allowing the combinations and in case if such combination is not
> possible, means
> there shouldn't be any server the supports the requirement, and connection
> fails.
> 
> comments?

I think that's OK.

To follow the existing naming, it seems better to use "primary" and "standby" 
instead of master and slave -- primary_conninfo, synchronous_standby_names, 
hot_standby, max_standby_streaming_delay and such.


> And also as we need to support the new option to connect to servers < 12
> also, this option
> sends the command "select pg_is_in_recovery()" to the server to find out
> whether the server
> is recovery mode or not?

The query looks good.  OTOH, I think we can return an error when 
target_server_type is specified against older servers because the parameter is 
new, if the libpq code would get uglier if we support older servers.


> And also regarding the implementation point of view, the new
> target_server_type option
> validation is separately handled, means the check for the required server
> is handled in a separate
> switch case, when both options are given, first find out the required server
> for target_session_attrs
> and validate the same again with target_server_type?

Logically, it seems the order should be reverse; check the server type first, 
then the session attributes, considering other session attributes in the future.


Regards
Takayuki Tsunakawa



Re: Add exclusive backup deprecation notes to documentation

2019-03-18 Thread Stephen Frost
Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 2019-03-07 10:33, David Steele wrote:
> > On 3/1/19 3:14 PM, Laurenz Albe wrote:
> >> Magnus Hagander wrote:
> >>> Maybe have the first note say "This method is deprecated bceause it has 
> >>> serious
> >>> risks (see bellow)" and then list the actual risks at the end?
> >>
> >> Good idea.  That may attract the attention of the dogs among the readers.
> > 
> > OK, here's a new version that splits the deprecation notes from the 
> > discussion of risks.  I also fixed the indentation.
> 
> The documentation changes appear to continue the theme from the other
> thread that the exclusive backup mode is terrible and everyone should
> feel bad about it.  I don't think there is consensus about that.

I don't view it as up for much debate.  The exclusive backup mode is
quite bad.

> I do welcome a more precise description of the handling of backup_label
> and a better hint in the error message.  I think we haven't gotten to
> the final shape there yet, especially for the latter.  I suggest to
> focus on that.

There isn't a way to handle the backup_label in a sane way when it's
created by the server in the data directory, which is why the
non-exclusive mode explicitly doesn't do that.

> I think it would be helpful to frame the documentation in a way to
> suggest that the nonexclusive mode is more for automation and more
> sophisticated tools and the exclusive mode is more for manual or simple
> scripted use.

I don't agree with this at all, that's not the reason the two exist nor
were they ever developed with the intent that one is for the 'simple'
case and one is for the 'automated' case.  Trying to wedge them into
that framework strikes me as simply trying to sweep the serious issues
under the rug and I don't agree with that- if we are going to continue
to have this, we need to make it clear what the issues are.  Sadly, we
will still have users who don't actually read the docs that carefully
and get bit by the exclusive backup mode because they didn't appreciate
the issues, but we will continue to have that until we finally remove
the exclusive mode.

> If we do think that the exclusive mode will be removed in PG13, then I
> don't think we need further documentation changes.  It already says it's
> deprecated, and we don't need to justify that at length.  But again, I'm
> not convinced that that will happen.

There were at least a few comments made on this thread that it wasn't
made clear enough in the documentation that it's deprecated.  Saying
that we already deprecated it doesn't change that and doesn't do
anything to actually address those concerns.  Continuing to carry
forward these two modes makes further progress in this area difficult
and unlikely to happen, and that's disappointing.

Thanks!

Stephen


signature.asc
Description: PGP signature


  1   2   >