Re: Add "password_protocol" connection parameter to libpq

2019-08-21 Thread Michael Paquier
On Tue, Aug 20, 2019 at 07:09:25PM -0700, Jeff Davis wrote:
> OK, new patch attached. Seems like everyone is in agreement that we
> need a channel_binding param.

+  
+A setting of require means that the connection must
+employ channel binding; and that the client will not respond to
+requests from the server for cleartext password or MD5
+authentication. The default setting is prefer,
+which employs channel binding if available.
+  
This counts for other auth requests as well like krb5, no?  I think
that we should also add the "disable" flavor for symmetry, and
also...

+#define DefaultChannelBinding  "prefer"
If the build of Postgres does not support SSL (USE_SSL is not
defined), I think that DefaultChannelBinding should be "disable".
That would make things consistent with sslmode.

+with PostgreSQL 11.0 or later servers using
Here I would use PostgreSQL 11, only mentioning the major version as
it was also available at beta time.

case AUTH_REQ_OK:
+   if (conn->channel_binding[0] == 'r' && !conn->channel_bound)
+   {
+   printfPQExpBuffer(&conn->errorMessage,
+ libpq_gettext("Channel binding required but 
not offered by server\n"));
+   return STATUS_ERROR;
+   }
Doing the filtering at the time of AUTH_REQ_OK is necessary for
"trust", but shouldn't this be done as well earlier for other request
types?  This could save round-trips to the server if we know that an
exchange begins with a protocol which will never satisfy this
request, saving efforts for a client doomed to fail after the first
AUTH_REQ received.  That would be the case for all AUTH_REQ, except
the SASL ones and of course AUTH_REQ_OK.

Could you please add negative tests in src/test/authentication/?  What
could be covered there is that the case where "prefer" (and
"disable"?) is defined then the authentication is able to go through,
and that with "require" we get a proper failure as SSL is not used.
Tests in src/test/ssl/ could include:
- Make sure that "require" works properly.
- Test after "disable".

+   if (conn->channel_binding[0] == 'r')
Maybe this should comment that this means "require", in a fashion
similar to what is done when checking conn->sslmode.
--
Michael


signature.asc
Description: PGP signature


Re: some SCRAM read_any_attr() confusion

2019-08-21 Thread Peter Eisentraut
On 2019-08-17 14:57, Michael Paquier wrote:
> On Sat, Aug 17, 2019 at 10:11:27AM +0200, Peter Eisentraut wrote:
>> I was a bit confused by some of the comments around the SCRAM function
>> read_any_attr(), used to skip over extensions.
>>
>> The comment "Returns NULL if there is attribute.", besides being
>> strangely worded, appears to be wrong anyway, because the function never
>> returns NULL.
> 
> This may have come from an incorrect merge with a past version when
> this code has been developed.  +1 for me for your suggestions and your
> patch.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: ICU for global collation

2019-08-21 Thread Peter Eisentraut
On 2019-08-21 08:56, Andrey Borodin wrote:
> postgres=# create database a template template0 collation_provider icu 
> lc_collate 'en_US.utf8';
> CREATE DATABASE
> postgres=# \c a
> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate 
> and ctype values are not supported by ICU
> FATAL:  collations with different collate and ctype values are not supported 
> by ICU

Try

create database a template template0 collation_provider icu locale
'en_US.utf8';

which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
valid ICU locale name.  Perhaps use 'en' or 'en-US'.

I'm making a note that we should prevent creating a database with a
faulty locale configuration in the first place instead of failing when
we're connecting.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Fix typos and inconsistencies for HEAD (take 11)

2019-08-21 Thread Michael Paquier
On Tue, Aug 20, 2019 at 04:47:41PM -0400, Alvaro Herrera wrote:
> Yeah, Alexander proposed change is correct.  I just pushed it.

Thanks, Alvaro.
--
Michael


signature.asc
Description: PGP signature


"ago" times on buildfarm status page

2019-08-21 Thread Peter Eisentraut
I find the time displays like

01:03 ago

on the buildfarm status page unhelpful.

First, I can never tell whether this is hours-minutes or minutes-seconds
-- there is probably a less ambiguous format available.

But more importantly, the page doesn't say when it was generated, so a
relative time like this is meaningless.  The page might have most
recently reloaded last night.  That means when I look at the page, I
*always* have to reload it first to make sense of the times.

I notice that the page source actually includes absolute times that are
then converted to relative using some JavaScript.  Could we perhaps just
turn that off?  Or preferably convert to local time.  I can much easier
make sense of an absolute local time: I can compare that to the clock in
the corner of the screen, and I can compare that, say, to a commit
timestamp.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: "ago" times on buildfarm status page

2019-08-21 Thread Magnus Hagander
On Wed, Aug 21, 2019 at 9:40 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> I find the time displays like
>
> 01:03 ago
>
> on the buildfarm status page unhelpful.
>
> First, I can never tell whether this is hours-minutes or minutes-seconds
> -- there is probably a less ambiguous format available.
>
> But more importantly, the page doesn't say when it was generated, so a
> relative time like this is meaningless.  The page might have most
> recently reloaded last night.  That means when I look at the page, I
> *always* have to reload it first to make sense of the times.
>
> I notice that the page source actually includes absolute times that are
> then converted to relative using some JavaScript.  Could we perhaps just
> turn that off?  Or preferably convert to local time.  I can much easier
> make sense of an absolute local time: I can compare that to the clock in
> the corner of the screen, and I can compare that, say, to a commit
> timestamp.
>

It used to be that the "ago" part was generated on the server, but Andrew
changed that to the fixed timestamp + javascript to improve cachability and
thus performance. Perhaps now that it's that it could be as easy as adding
a checkbox to the page (which could remember your preference in a cookie)
that switches between the two modes?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Tue, Aug 20, 2019 at 7:57 PM Robert Haas  wrote:
>
> On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar  wrote:
> > Currently, In UnpackedUndoRecord we store all members directly which
> > are set by the caller.  We store pointers to some header which are
> > allocated internally by the undo layer and the caller need not worry
> > about setting them.  So now you are suggesting to put other headers
> > also as structures in UnpackedUndoRecord.  I as such don't have much
> > problem in doing that but I think initially Robert designed
> > UnpackedUndoRecord structure this way so it will be good if Robert
> > provides his opinion on this.
>
> I don't believe that's what is being suggested.  It seems to me that
> the thing Andres is complaining about here has roots in the original
> sketch that I did for this code.  The oldest version I can find is
> here:
>
> https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c
>
> In this version, and I think still in the current version, there is a
> two-stage marshaling strategy.  First, the individual fields from the
> UnpackedUndoRecord get copied into global variables (yes, that was my
> fault, too, at least in part!) which are structures. Then, the
> structures get copied into the target buffer. The idea of that design
> was to keep the code simple, but it didn't really work out, because
> things got a lot more complicated between the time I wrote those 3244
> lines of code and the >3000 lines of code that live in this patch
> today. One thing that change was that we moved more and more in the
> direction of considering individual fields as separate objects to be
> separately included or excluded, whereas when I wrote that code I
> thought we were going to have groups of related fields that stood or
> fell together. That idea turned out to be wrong. (There is the
> even-larger question here of whether we ought to take Heikki's
> suggestion and make this whole thing a lot more generic, but let's
> start by discussing how the design that we have today could be
> better-implemented.)
>
> If I understand Andres correctly, he's arguing that we ought to get
> rid of the two-stage marshaling strategy.  During decoding, he wants
> data to go directly from the buffer that contains it to the
> UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
> During insertion, he wants data to go directly from the
> UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
> there has to be an intermediate format, he wants it to be just a chunk
> of raw bytes, rather than a bunch of individual fields like we have in
> UndoPackContext currently.  I think that's a reasonable goal.  I'm not
> as concerned about it as he is from a performance point of view, but I
> think it would make the code look nicer, and that would be good.  If
> we save CPU cycles along the way, that is also good.
>
> In broad outline, what this means is:
>
> 1. Any field in the UndoPackContext that starts with urec_ goes away.
> 2. Instead of something like InsertUndoBytes((char *)
> &(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
> &uur->uur_fxid, ...).
> 3. Similarly instead of ReadUndoBytes((char *) &ucontext->urec_fxid,
> ...) we'd write ReadUndoBytes((char *) &uur->uur_fxid, ...).
> 4. It seems slightly trickier to handle the cases where we've got a
> structure instead of individual fields, like urec_hd.  But those could
> be broken down into field-by-field reads and writes, e.g. in this case
> one call for urec_type and a second for urec_info.
> 5. For uur_group and uur_logswitch, the code would need to allocate
> those subsidiary structures before copying into them.
>
> To me, that seems like it ought to be a pretty straightforward change
> that just makes things simpler.  We'd probably need to pass the
> UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
> keep a pointer to it in the UnpackUndoContext, but that seems fine.
> FinishUnpackUndo would end up just about empty, maybe entirely empty.
>
> Is that a reasonable idea?
>
I have already attempted that part and I feel it is not making code
any simpler than what we have today.  For packing, it's fine because I
can process all the member once and directly pack it into one memory
chunk and I can insert that to the buffer by one call of
InsertUndoBytes and that will make the code simpler.

But, while unpacking if I directly unpack to the UnpackUndoRecord then
there are few complexities.  I am not saying those are difficult to
implement but code may not look better.

a) First, we need to add extra stages for unpacking as we need to do
field by field.

b) Some of the members like uur_payload and uur_tuple are not the same
type in the UnpackUndoRecord compared to how it is stored in the page.
In UnpackUndoRecord those are StringInfoData whereas on the page we
store it as UndoRecordPayload header followed by the actual data.  I
am not saying we can not unpack this directly we

Re: Global temporary tables

2019-08-21 Thread Konstantin Knizhnik



On 20.08.2019 20:01, Pavel Stehule wrote:
Another solution is wait on ZHeap storage and replica can to have own 
UNDO log.





I thought about implementation of special table access method for
temporary tables.


+1

Unfortunately implementing special table access method for temporary 
tables doesn't solve all problems.

XID generation is not part of table access methods.
So we still need to assign some XID to write transaction at replica 
which will not conflict with XIDs received from master.
Actually only global temp tables can be updated at replica and so 
assigned XIDs can be stored only in tuples of such relations.
But still I am not sure that we can use arbitrary XID for such 
transactions at replica.


Also I upset by amount of functionality which has to be reimplemented 
for global temp tables if we really want to provide access method for them:


1. CLOG
2. vacuum
3. MVCC visibility

And still it is not possible to encapsulate all changes need to support 
writes to temp tables at replica inside table access method.
XID assignment, transaction commit and abort, subtransactions - all this 
places need to be patched.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Optimization of vacuum for logical replication

2019-08-21 Thread Konstantin Knizhnik

Hi, hackers.

Right now if replication level is rgeater or equal than "replica", 
vacuum  of relation copies all its data to WAL:



    /*
     * We need to log the copied data in WAL iff WAL archiving/streaming is
     * enabled AND it's a WAL-logged rel.
     */
    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);

Obviously we have to do it for physical replication and WAL archiving.
But why do we need to do so expensive operation (actually copy all table 
data three times) if we use logical replication?
Logically vacuum doesn't change relation so there is no need to write 
any data to the log and process it by WAL sender.


I wonder if we can check that

1. wal_revel is "logical"
2. There are no physical replication slots
3. WAL archiving is disables

and in this cases do not write cloned relation to the WAL?
Small patch implementing such behavior is attached to this mail.
It allows to significantly reduce WAL size when performing vacuum at 
multimaster, which uses logical replication between cluster nodes.


What can be wrong with such optimization?

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e..1d0a957 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/bufmgr.h"
@@ -719,7 +720,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	 * We need to log the copied data in WAL iff WAL archiving/streaming is
 	 * enabled AND it's a WAL-logged rel.
 	 */
-	use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
+	use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap)
+		&& (wal_level != WAL_LEVEL_LOGICAL || ReplicationSlotsCountPhysicalSlots() != 0 || XLogArchiveMode != ARCHIVE_MODE_OFF);
 
 	/* use_wal off requires smgr_targblock be initially invalid */
 	Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 62342a6..7961056 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -829,6 +829,27 @@ ReplicationSlotsComputeLogicalRestartLSN(void)
 }
 
 /*
+ * Count physical replication slots
+ */
+int
+ReplicationSlotsCountPhysicalSlots(void)
+{
+	int			i;
+	int			n_slots = 0;
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+		n_slots += !SlotIsLogical(s);
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	return n_slots;
+}
+
+
+/*
  * ReplicationSlotsCountDBSlots -- count the number of slots that refer to the
  * passed database oid.
  *
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..50ca192 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -199,6 +199,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern int  ReplicationSlotsCountPhysicalSlots(void);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);


Re: Optimization of vacuum for logical replication

2019-08-21 Thread Bernd Helmle
Am Mittwoch, den 21.08.2019, 12:20 +0300 schrieb Konstantin Knizhnik:
> I wonder if we can check that
> 
> 1. wal_revel is "logical"
> 2. There are no physical replication slots
> 3. WAL archiving is disables

Not sure i get that correctly, i can still have a physical standby
without replication slots connected to such an instance. How would your
idea handle this situation?

Bernd






Re: Remove one last occurrence of "replication slave" in comments

2019-08-21 Thread Peter Eisentraut
On 2019-06-19 19:04, Dagfinn Ilmari Mannsåker wrote:
> There were some more master/slave references in the plpgsql foreign key
> tests, which the attached chages to base/leaf instead.

base/leaf doesn't sound like a good pair.  I committed it with root/leaf
instead.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimization of vacuum for logical replication

2019-08-21 Thread Konstantin Knizhnik




On 21.08.2019 12:34, Bernd Helmle wrote:

Am Mittwoch, den 21.08.2019, 12:20 +0300 schrieb Konstantin Knizhnik:

I wonder if we can check that

1. wal_revel is "logical"
2. There are no physical replication slots
3. WAL archiving is disables

Not sure i get that correctly, i can still have a physical standby
without replication slots connected to such an instance. How would your
idea handle this situation?


Yes, it is possible to have physical replica withotu replication slot.
But it is not safe, because there is always a risk that lag between 
master and replica becomes larger than size of WAL kept at master.
Also I can't believe that  DBA which explicitly sets wal_level is set to 
logical will use streaming replication without associated replication slot.


And certainly it is possible to add GUC which controls such optimization.

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Amit Kapila
On Tue, Aug 20, 2019 at 8:10 PM Robert Haas  wrote:
>
> On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > > Well, my main point, which so far has largely been ignored, was that we
> > > may not acquire page locks when we still need to search for victim
> > > buffers later. If we don't need to lock the pages up-front, but only do
> > > so once we're actually copying the records into the undo pages, then we
> > > don't a separate phase to acquire the locks. We can still hold all of
> > > the page locks at the same time, as long as we just acquire them at the
> > > later stage.
> >
> > Okay, IIUC, this means that we should have a separate phase where we
> > call LockUndoBuffers (or something like that) before
> > InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> > will lock all the buffers pinned during PrepareUndoInsert.  We can
> > probably call LockUndoBuffers before entering the critical section to
> > avoid any kind of failure in critical section.  If so, that sounds
> > reasonable to me.
>
> I'm kind of scratching my head here, because this is clearly different
> than what Andres said in the quoted text to which you were replying.
> He clearly implied that we should acquire the buffer locks within the
> critical section during InsertPreparedUndo, and you responded by
> proposing to do it outside the critical section in a separate step.
> Regardless of which way is actually better, when somebody says "hey,
> let's do A!" and you respond by saying "sounds good, I'll go implement
> B!" that's not really helping us to get toward a solution.
>

I got confused by the statement "We can still hold all of the page
locks at the same time, as long as we just acquire them at the later
stage."

> FWIW, although I also thought of doing what you are describing here, I
> think Andres's proposal is probably preferable, because it's simpler.
> There's not really any reason why we can't take the buffer locks from
> within the critical section, and that way callers don't have to deal
> with the extra step.
>

IIRC, the reason this was done before starting critical section was
because of coding convention mentioned in src/access/transam/README
(Section: Write-Ahead Log Coding).   It says first pin and exclusive
lock the shared buffers and then start critical section.  It might be
that we can bypass that convention here, but I guess it is mainly to
avoid any error in the critical section.  I have checked the
LWLockAcquire path and there doesn't seem to be any reason that it
will throw error except when the caller has acquired many locks at the
same time which is not the case here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Optimization of vacuum for logical replication

2019-08-21 Thread Sergei Kornilov
Hello

> Also I can't believe that  DBA which explicitly sets wal_level is set to
> logical will use streaming replication without associated replication slot.

I am.

> Yes, it is possible to have physical replica withotu replication slot.
> But it is not safe, because there is always a risk that lag between
> master and replica becomes larger than size of WAL kept at master.

Just an example: replica for manual queries, QA purposes or for something else 
that is not an important part of the system.
If I use replication slots - my risk is out-of-space on primary and therefore 
shutdown of primary. With downtime for application.
If I use wal_keep_segments instead - I have some limited (and usually stable) 
amount of WAL but risk to have outdated replica.

I prefer to have an outdated replica but primary is more safe. Its OK for me to 
just take fresh pg_basebackup from another replica.
And application want to use logical replication so wal_level = logical.

If we not want support such usecase - we need explicitly forbid replication 
without replication slots.

regards, Sergei




Re: Optimization of vacuum for logical replication

2019-08-21 Thread Bernd Helmle
Am Mittwoch, den 21.08.2019, 13:26 +0300 schrieb Konstantin Knizhnik:
> Yes, it is possible to have physical replica withotu replication
> slot.
> But it is not safe, because there is always a risk that lag between 
> master and replica becomes larger than size of WAL kept at master.

Sure, but that doesn't mean use cases for this aren't real.

> Also I can't believe that  DBA which explicitly sets wal_level is set
> to 
> logical will use streaming replication without associated replication
> slot.

Well, i know people doing exactly this, for various reasons (short
living replicas, logical replicated table sets for reports, ...). The
fact that they can have loosely coupled replicas with either physical
or logical replication is a feature they'd really miss

Bernd





Re: "ago" times on buildfarm status page

2019-08-21 Thread Dagfinn Ilmari Mannsåker
Magnus Hagander  writes:

> On Wed, Aug 21, 2019 at 9:40 AM Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>
>> I find the time displays like
>>
>> 01:03 ago
>>
>> on the buildfarm status page unhelpful.
>>
>> First, I can never tell whether this is hours-minutes or minutes-seconds
>> -- there is probably a less ambiguous format available.
>>
>> But more importantly, the page doesn't say when it was generated, so a
>> relative time like this is meaningless.  The page might have most
>> recently reloaded last night.  That means when I look at the page, I
>> *always* have to reload it first to make sense of the times.
>>
>> I notice that the page source actually includes absolute times that are
>> then converted to relative using some JavaScript.  Could we perhaps just
>> turn that off?  Or preferably convert to local time.  I can much easier
>> make sense of an absolute local time: I can compare that to the clock in
>> the corner of the screen, and I can compare that, say, to a commit
>> timestamp.
>>
>
> It used to be that the "ago" part was generated on the server, but Andrew
> changed that to the fixed timestamp + javascript to improve cachability and
> thus performance. Perhaps now that it's that it could be as easy as adding
> a checkbox to the page (which could remember your preference in a cookie)
> that switches between the two modes?

The Javscript could also be made to update the "ago" part every minute,
and show the absoulte time as a tooltip, which is what pretty much every
other website does.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen




when the IndexScan reset to the next ScanKey for in operator

2019-08-21 Thread Alex
given the following example:
postgres=# create table t2 as select generate_series(1, 10) as a,
 generate_series(1, 10) as b;
SELECT 10
postgres=# create index t2_idx on t2(a);
CREATE INDEX
postgres=# set enable_seqscan = 0;
SET
postgres=# select * from t2 where a in (1, 10);
   a|   b
+
  1 |  1
 10 | 10
(2 rows)


I can see the plan stores the "1 and 10" information in
IndexScan->indexqual, which is an SCALARARRAYOPEXPR expression.

suppose the executor  should scan 1 first,  If all the tuples for 1 has
been scanned,  then **it should be reset to 10**  and scan again.
 however I can't find out the code for that.  looks index_rescan is not for
this.   am I miss something?

thanks


WIP/PoC for parallel backup

2019-08-21 Thread Asif Rehman
Hi Hackers,

I have been looking into adding parallel backup feature in pg_basebackup.
Currently pg_basebackup sends BASE_BACKUP command for taking full backup,
server scans the PGDATA and sends the files to pg_basebackup. In general,
server takes the following steps on BASE_BACKUP command:

- do pg_start_backup
- scans PGDATA, creates and send header containing information of
tablespaces.
- sends each tablespace to pg_basebackup.
- and then do pg_stop_backup

All these steps are executed sequentially by a single process. The idea I
am working on is to separate these steps into multiple commands in
replication grammer. Add worker processes to the pg_basebackup where they
can copy the contents of PGDATA in parallel.

The command line interface syntax would be like:
pg_basebackup --jobs=WORKERS


Replication commands:

- BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
If the parallel option is there, then it will only do pg_start_backup,
scans PGDATA and sends a list of file names.

- SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
pg_basebackup will then send back a list of filenames in this command. This
commands will be send by each worker and that worker will be getting the
said files.

- STOP_BACKUP
when all workers finish then, pg_basebackup will send STOP_BACKUP command.

The pg_basebackup can start by sending "BASE_BACKUP PARALLEL" command and
getting a list of filenames from the server in response. It should then
divide this list as per --jobs parameter. (This division can be based on
file sizes). Each of the worker process will issue a SEND_FILES_CONTENTS
(file1, file2,...) command. In response, the server will send the files
mentioned in the list back to the requesting worker process.

Once all the files are copied, then pg_basebackup will send the STOP_BACKUP
command. Similar idea has been been discussed by Robert, on the incremental
backup thread a while ago. This is similar to that but instead of
START_BACKUP and SEND_FILE_LIST, I have combined them into BASE_BACKUP
PARALLEL.

I have done a basic proof of concenpt (POC), which is also attached. I
would appreciate some input on this. So far, I am simply dividing the list
equally and assigning them to worker processes. I intend to fine tune this
by taking into consideration file sizes. Further to add tar format support,
I am considering that each worker process, processes all files belonging to
a tablespace in its list (i.e. creates and copies tar file), before it
processes the next tablespace. As a result, this will create tar files that
are disjointed with respect tablespace data. For example:

Say, tablespace t1 has 20 files and we have 5 worker processes and
tablespace t2 has 10. Ignoring all other factors for the sake of this
example, each worker process will get a group of 4 files of t1 and 2 files
of t2. Each process will create 2 tar files, one for t1 containing 4 files
and another for t2 containing 2 files.


Regards,
Asif


0001-Initial-POC-on-parallel-backup.patch
Description: Binary data


Re: Remove one last occurrence of "replication slave" in comments

2019-08-21 Thread Dagfinn Ilmari Mannsåker
Peter Eisentraut  writes:

> On 2019-06-19 19:04, Dagfinn Ilmari Mannsåker wrote:
>> There were some more master/slave references in the plpgsql foreign key
>> tests, which the attached chages to base/leaf instead.
>
> base/leaf doesn't sound like a good pair.  I committed it with root/leaf
> instead.

Thanks!  You're right, that is a better name pair.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen




Re: "ago" times on buildfarm status page

2019-08-21 Thread Tom Lane
Magnus Hagander  writes:
> On Wed, Aug 21, 2019 at 9:40 AM Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>> I find the time displays like
>> 01:03 ago
>> on the buildfarm status page unhelpful.
>> 
>> I notice that the page source actually includes absolute times that are
>> then converted to relative using some JavaScript.  Could we perhaps just
>> turn that off?

> It used to be that the "ago" part was generated on the server, but Andrew
> changed that to the fixed timestamp + javascript to improve cachability and
> thus performance. Perhaps now that it's that it could be as easy as adding
> a checkbox to the page (which could remember your preference in a cookie)
> that switches between the two modes?

FWIW, I'm used to the way it's shown and would not like a change, so
if this can be made user-settable as Magnus suggests, that would be
better IMO.

The real problem with that column though is that it relies on run start
times that are self-reported by the buildfarm clients, and some of them
have system clocks that are many hours off reality.  What *I'd* like to
see is for the column to contain time of receipt of the buildfarm report
at the server, less the measured runtime of the test.

regards, tom lane




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-21 Thread Kyotaro Horiguchi
Hello. New version is attached.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190819.185959.118543656.horikyota@gmail.com>
> Thank you for taking time.
> 
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> <20190818035230.gb3021...@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
 
Now TwoPhaseFileHeader has two new members for pending syncs. It
is useless on wal-replay, but that is needed for commit-prepared.

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/access/heap/heapam_handler.c
> > > +++ b/src/backend/access/heap/heapam_handler.c
> > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > > Relation NewHeap, 
> ...
> > > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > > -
> > >   /* use_wal off requires smgr_targblock be initially invalid */
> > >   Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> > 
> > Since you're deleting the use_wal variable, update that last comment.

Oops! Rewrote it.

> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may 
> > be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

All forks are synced and have no WALs emitted (as before) in the
attached version 19. FSM and VM are not changed.

> > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > file
> > at that point."  Please write the part to WAL-log the contents of small 
> > files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

This is not included in this version. I'll continue to consider
this.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >* If it does commit, we'll have done the table_finish_bulk_insert() at
> > >* the bottom of this routine first.
> > >*
> > > -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -  * is not always set correctly, since in rare cases 
> > > rd_newRelfilenodeSubid
> > > -  * can be cleared before the end of the transaction. The exact case is
> > > -  * when a relation sets a new relfilenode twice in same transaction, yet
> > > -  * the second one fails in an aborted subtransaction, e.g.
> > > -  *
> > > -  * BEGIN;
> > > -  * TRUNCATE t;
> > > -  * SAVEPOINT save;
> > > -  * TRUNCATE t;
> > > -  * ROLLBACK TO save;
> > > -  * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.

The code is changed to use rd_firstRelfilenodeSubid instead of
rd_firstRelfilenodeSubid which has the issue mentioned in the
deleted section. So this is right but irrelevant to the code
here. The same thing is written in the comment in RelationData.

(In short, not reverted)

> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
..
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and 
> > audit
> > all the lines printed.  Many bits of code need to look at all three,
> > e.g. RelationClose().

I forgot to maintain rd_firstRelfilenode in many places and the
assertion failure no longer happens after I fixed it. Opposite to
my previous mail, of course useless pending entries are removed
at subtransction abort and no needless syncs happen in that
meaning. But another type of useless sync was seen with the
previous version 18.

(In short fixed.)


> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilen

Re: "ago" times on buildfarm status page

2019-08-21 Thread Andrew Dunstan


On 8/21/19 8:32 AM, Dagfinn Ilmari Mannsåker wrote:
> Magnus Hagander  writes:
>
>> On Wed, Aug 21, 2019 at 9:40 AM Peter Eisentraut <
>> peter.eisentr...@2ndquadrant.com> wrote:
>>
>>> I find the time displays like
>>>
>>> 01:03 ago
>>>
>>> on the buildfarm status page unhelpful.
>>>
>>> First, I can never tell whether this is hours-minutes or minutes-seconds
>>> -- there is probably a less ambiguous format available.
>>>
>>> But more importantly, the page doesn't say when it was generated, so a
>>> relative time like this is meaningless.  The page might have most
>>> recently reloaded last night.  That means when I look at the page, I
>>> *always* have to reload it first to make sense of the times.
>>>
>>> I notice that the page source actually includes absolute times that are
>>> then converted to relative using some JavaScript.  Could we perhaps just
>>> turn that off?  Or preferably convert to local time.  I can much easier
>>> make sense of an absolute local time: I can compare that to the clock in
>>> the corner of the screen, and I can compare that, say, to a commit
>>> timestamp.
>>>
>> It used to be that the "ago" part was generated on the server, but Andrew
>> changed that to the fixed timestamp + javascript to improve cachability and
>> thus performance. Perhaps now that it's that it could be as easy as adding
>> a checkbox to the page (which could remember your preference in a cookie)
>> that switches between the two modes?
> The Javscript could also be made to update the "ago" part every minute,
> and show the absoulte time as a tooltip, which is what pretty much every
> other website does.
>


The code for the page is here:



Patches welcome.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Optimization of vacuum for logical replication

2019-08-21 Thread Konstantin Knizhnik




On 21.08.2019 14:45, Bernd Helmle wrote:

Am Mittwoch, den 21.08.2019, 13:26 +0300 schrieb Konstantin Knizhnik:

Yes, it is possible to have physical replica withotu replication
slot.
But it is not safe, because there is always a risk that lag between
master and replica becomes larger than size of WAL kept at master.

Sure, but that doesn't mean use cases for this aren't real.


Also I can't believe that  DBA which explicitly sets wal_level is set
to
logical will use streaming replication without associated replication
slot.

Well, i know people doing exactly this, for various reasons (short
living replicas, logical replicated table sets for reports, ...). The
fact that they can have loosely coupled replicas with either physical
or logical replication is a feature they'd really miss

Bernd



Ok, you convinced me that there are cases when people want to combine 
logical replication with streaming replication without slot.
But is it acceptable to have GUC variable (disabled by default) which 
allows to use this optimizations?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: "ago" times on buildfarm status page

2019-08-21 Thread Andrew Dunstan


On 8/21/19 9:55 AM, Tom Lane wrote:
>
> The real problem with that column though is that it relies on run start
> times that are self-reported by the buildfarm clients, and some of them
> have system clocks that are many hours off reality.  What *I'd* like to
> see is for the column to contain time of receipt of the buildfarm report
> at the server, less the measured runtime of the test.


That might be possible. I'll put it on my list of things to do. It's not
happening any time soon, though.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: "ago" times on buildfarm status page

2019-08-21 Thread Andrew Dunstan


On 8/21/19 3:40 AM, Peter Eisentraut wrote:
> I find the time displays like
>
> 01:03 ago
>
> on the buildfarm status page unhelpful.
>
> First, I can never tell whether this is hours-minutes or minutes-seconds
> -- there is probably a less ambiguous format available.



This is hours:minutes. days are always explicitly prepended if greater
than 0.


>
> But more importantly, the page doesn't say when it was generated, so a
> relative time like this is meaningless.  The page might have most
> recently reloaded last night.  That means when I look at the page, I
> *always* have to reload it first to make sense of the times.
>
> I notice that the page source actually includes absolute times that are
> then converted to relative using some JavaScript.  Could we perhaps just
> turn that off?  Or preferably convert to local time.  I can much easier
> make sense of an absolute local time: I can compare that to the clock in
> the corner of the screen, and I can compare that, say, to a commit
> timestamp.
>


Maybe.


TBH the whole buildfarm UI is rather dated and clunky. But it needs work
from a web monkey, which I am really not.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-21 Thread Alvaro Herrera
Can I interest someone into updating this patch?  We now have (I think)
an agreed design, and I think the development work needed should be
straightforward.  We also already have the popcount stuff, so that's a
few lines to be removed from the patch ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas
On Wed, Aug 21, 2019 at 6:38 AM Amit Kapila  wrote:
> > FWIW, although I also thought of doing what you are describing here, I
> > think Andres's proposal is probably preferable, because it's simpler.
> > There's not really any reason why we can't take the buffer locks from
> > within the critical section, and that way callers don't have to deal
> > with the extra step.
>
> IIRC, the reason this was done before starting critical section was
> because of coding convention mentioned in src/access/transam/README
> (Section: Write-Ahead Log Coding).   It says first pin and exclusive
> lock the shared buffers and then start critical section.  It might be
> that we can bypass that convention here, but I guess it is mainly to
> avoid any error in the critical section.  I have checked the
> LWLockAcquire path and there doesn't seem to be any reason that it
> will throw error except when the caller has acquired many locks at the
> same time which is not the case here.

Yeah, I think it's fine to deviate from that convention in this
respect.  We treat LWLockAcquire() as a no-fail operation in many
places; in my opinion, that elog(ERROR) that we have for too many
LWLocks should be changed to elog(PANIC) precisely because we do treat
LWLockAcquire() as no-fail in lots of places in the code, but I think
I suggested that once and got slapped down, and I haven't had the
energy to fight about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-21 Thread Masahiko Sawada
On Thu, Aug 22, 2019 at 12:19 AM Alvaro Herrera
 wrote:
>
> Can I interest someone into updating this patch?  We now have (I think)
> an agreed design, and I think the development work needed should be
> straightforward.  We also already have the popcount stuff, so that's a
> few lines to be removed from the patch ...
>

I will update the patch and register to the next Commit Fest tomorrow
if nobody is interested in.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas
On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar  wrote:
> I have already attempted that part and I feel it is not making code
> any simpler than what we have today.  For packing, it's fine because I
> can process all the member once and directly pack it into one memory
> chunk and I can insert that to the buffer by one call of
> InsertUndoBytes and that will make the code simpler.

OK...

> But, while unpacking if I directly unpack to the UnpackUndoRecord then
> there are few complexities.  I am not saying those are difficult to
> implement but code may not look better.
>
> a) First, we need to add extra stages for unpacking as we need to do
> field by field.
>
> b) Some of the members like uur_payload and uur_tuple are not the same
> type in the UnpackUndoRecord compared to how it is stored in the page.
> In UnpackUndoRecord those are StringInfoData whereas on the page we
> store it as UndoRecordPayload header followed by the actual data.  I
> am not saying we can not unpack this directly we can do it like,
> first read the payload length from the page in uur_payload.len then
> read tuple length in uur_tuple.len then read both the data.  And, for
> that, we will have to add extra stages.

I don't think that this is true; or at least, I think it's avoidable.
The idea of an unpacking stage is that you refuse to advance to the
next stage until you've got a certain number of bytes of data; and
then you unpack everything that pertains to that stage.  If you have 2
4-byte fields that you want to unpack together, you can just wait
until you've 8 bytes of data and then unpack both.  You don't really
need 2 separate stages.  (Similarly, your concern about fields being
in a different order seems like it should be resolved by agreeing on
one ordering and having everything use it; I don't know why there
should be one order that is better in memory and another order that is
better on disk.)

The bigger issue here is that we don't seem to be making very much
progress toward improving the overall design.  Several significant
improvements have been suggested:

1. Thomas suggested quite some time ago that we should make sure that
the transaction header is the first optional header.  If we do that,
then I'm not clear on why we even need this incremental unpacking
stuff any more. The only reason for all of this machinery was so that
we could find the transaction header at an unknown offset inside a
complex record format; there is, if I understand correctly, no other
case in which we want to incrementally decode a record. But if the
transaction header is at a fixed offset, then there seems to be no
need to even have incremental decoding at all.  Or it can be very
simple, with three stages: (1) we don't yet have enough bytes to
figure out how big the record is; (2) we have enough bytes to figure
out how big the record is and we have figured that out but we don't
yet have all of those bytes; and (3) we have the whole record, we can
decode the whole thing and we're done.

2. Based on a discussion with Thomas, I suggested the GHOB stuff,
which gets rid of the idea of transaction headers inside individual
records altogether; instead, there is one undo blob per transaction
(or maybe several if we overflow to another undo log) which begins
with a sentinel byte that identifies it as owned by a transaction, and
then the transaction header immediately follows that without being
part of any record, and the records follow that data.  As with the
previous idea, this gets rid of the need for incremental decoding
because it gets rid of the need to find the transaction header inside
of a bigger record. As Thomas put it to me off-list, it puts the
records inside of a larger chunk of space owned by the transaction
instead of putting the transaction header inside of some particular
record; that seems more correct than the way we have it now.

3. Heikki suggested that the format should be much more generic and
that more should be left up to the AM.  While neither Andres nor I are
convinced that we should go as far in that direction as Heikki is
proposing, the idea clearly has some merit, and we should probably be
moving that way to some degree. For instance, the idea that we should
store a block number and TID is a bit sketchy when you consider that a
multi-insert operation really wants to store a TID list. The zheap
tree has a ridiculous kludge to work around that problem; clearly we
need something better.  We've also mentioned that, in the future, we
might want to support TIDs that are not 6 bytes, and that even just
looking at what's currently under development, zedstore wants to treat
TIDs as 48-bit opaque quantities, not a 4-byte block number and a
2-byte item pointer offset.  So, there is clearly a need to go through
the whole way we're doing this and rethink which parts are generic and
which parts are AM-specific.

4. A related problem, which has been mentioned or at least alluded to
by both Heikki and by me, is that we need a better way of 

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-21 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> +
> + External tools may also
> + modify postgresql.auto.conf.  It is not
> + recommended to do this while the server is running, since a
> + concurrent ALTER SYSTEM command could overwrite
> + such changes.  Such tools might simply append new settings to the end,
> + or they might choose to remove duplicate settings and/or comments
> + (as ALTER SYSTEM will).
>  

While I don't know that we necessairly have to change this langauge, I
did want to point out for folks who look at these things and consider
the challenges of this change that simply appending, when it comes to
things like backup tools and such, is just not going to work, since
you'll run into things like this:

FATAL:  multiple recovery targets specified
DETAIL:  At most one of recovery_target, recovery_target_lsn, 
recovery_target_name, recovery_target_time, recovery_target_xid may be set.

That's from simply doing a backup, restore with one recovery target,
then back that up and restore with a different recovery target.

Further there's the issue that if you specify a recovery target for the
first restore and then *don't* have one for the second restore, then
you'll still end up trying to restore to the first point...  So,
basically, appending just isn't actually practical for what is probably
the most common use-case these days for an external tool to go modify
postgresql.auto.conf.

And so, every backup tool author that lets a user specify a target
during the restore to generate the postgresql.auto.conf with (formerly
recovery.conf) is going to have to write enough of a parser for PG
config files to be able to find and comment or remove any recovery
target options from postgresql.auto.conf.

That'd be the kind of thing that I was really hoping we could provide a
common library for.

Thanks,

Stephen


signature.asc
Description: PGP signature


Why overhead of SPI is so large?

2019-08-21 Thread Konstantin Knizhnik

Hi, hackers.

One of our customers complains about slow execution of PL/pgSQL 
functions comparing with Oracle.
So he wants to compile PL/pgSQL functions (most likely just-in-time 
compilation).
Certainly interpreter adds quite large overhead comparing with native 
code (~10 times) but
most of PL/pgSQL functions are just running some SQL queues and 
iterating through results.


I can not believe that JIT can significantly speed-up such functions.
So I decided to make simple experiment:  I created large enough table 
and implemented functions

which calculates norm of one column in different languages.

Results are frustrating (at least for me):

PL/pgSQL:   29044.361 ms
C/SPI:          22785.597 ms
С/coreAPI: 2873.072 ms
PL/Lua:        33445.520 ms
SQL:              7397.639 ms (with parallel execution disabled)

The fact that difference between PL/pgSQL and function implemented in C 
using SPI is not so large  was expected by me.

But why it is more than 3 time slower than correspondent SQL query?
The answer seems to be in the third result: the same function in C 
implemented without SPI (usign table_beginscan/heap_getnext)

Looks like SPI adds quite significant overhead.
And as far as almost all PL languages are using SPI, them all suffer 
from it.


Below is profile of SPI function execution:

  9.47%  postgres  libc-2.23.so   [.] __memcpy_avx_unaligned
   9.19%  postgres  spitest.so [.] spi_norm
   8.09%  postgres  postgres   [.] AllocSetAlloc
   4.50%  postgres  postgres   [.] tts_buffer_heap_getsomeattrs
   4.36%  postgres  postgres   [.] heap_form_tuple
   3.41%  postgres  postgres   [.] standard_ExecutorRun
   3.35%  postgres  postgres   [.] ExecScan
   3.31%  postgres  postgres   [.] palloc0
   2.41%  postgres  postgres   [.] heapgettup_pagemode
   2.40%  postgres  postgres   [.] AllocSetReset
   2.25%  postgres  postgres   [.] PopActiveSnapshot
   2.17%  postgres  postgres   [.] PortalRunFetch
   2.16%  postgres  postgres   [.] HeapTupleSatisfiesVisibility
   1.97%  postgres  libc-2.23.so   [.] __sigsetjmp
   1.90%  postgres  postgres   [.] _SPI_cursor_operation
   1.87%  postgres  postgres   [.] AllocSetFree
   1.86%  postgres  postgres   [.] PortalRunSelect
   1.79%  postgres  postgres   [.] heap_getnextslot
   1.75%  postgres  postgres   [.] heap_fill_tuple
   1.70%  postgres  postgres   [.] spi_dest_startup
   1.50%  postgres  postgres   [.] spi_printtup
   1.49%  postgres  postgres   [.] nocachegetattr
   1.45%  postgres  postgres   [.] MemoryContextDelete
   1.44%  postgres  postgres   [.] ExecJustAssignScanVar
   1.38%  postgres  postgres   [.] CreateTupleDescCopy
   1.33%  postgres  postgres   [.] SPI_getbinval
   1.30%  postgres  postgres   [.] PushActiveSnapshot
   1.30%  postgres  postgres   [.] AllocSetContextCreateInternal
   1.22%  postgres  postgres   [.] heap_compute_data_size
   1.22%  postgres  postgres   [.] MemoryContextCreate
   1.14%  postgres  postgres   [.] heapgetpage
   1.05%  postgres  postgres   [.] palloc
   1.03%  postgres  postgres   [.] SeqNext

As you can see, most of the time is spent in allocation and copying memory.
I wonder if somebody tried to address this problem and are there some 
plans for improving speed of PL/pgSQL and other

stored languages?

I attached to this mail sources of spi_test extension with my experiments.
Please build it and run norm.sql file.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



spi_test.tgz
Description: application/compressed-tar


Re: [proposal] de-TOAST'ing using a iterator

2019-08-21 Thread Binguo Bao
John Naylor  于2019年8月19日周一 下午12:55写道:

> init_toast_buffer():
>
> + * Note the constrain buf->position <= buf->limit may be broken
> + * at initialization. Make sure that the constrain is satisfied
> + * when consume chars.
>
> s/constrain/constraint/ (2 times)
> s/consume/consuming/
>
> Also, this comment might be better at the top the whole function?
>

The constraint is broken in the if branch, so I think put this comment in
the branch
is more precise.

The check
> if (iter->buf != iter->fetch_datum_iterator->buf)
> is what we need to do for the compressed case. Could we use this
> directly instead of having a separate state variable iter->compressed,
> with a macro like this?
> #define TOAST_ITER_COMPRESSED(iter) \
> (iter->buf != iter->fetch_datum_iterator->buf)


 The logic of the macro may be hard to understand, so I think it's ok to
just check the compressed state variable.

+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
> + * the field is invalid and need to read the control byte from the
> + * source buffer in the next iteration, see pglz_decompress_iterate().
> + */
> +#define INVALID_CTRLC 8
>
> I think the macro might be better placed in pg_lzcompress.h, and for
> consistency used in pglz_decompress(). Then the comment can be shorter
> and more general. With my additional comment in
> init_detoast_iterator(), hopefully it will be clear to readers.
>

The main role of this macro is to explain the iterator's "ctrlc" state, IMO
it's reasonable to put
the macro and definition of de-TOAST iterator together.

Thanks for your suggestion, I have updated the patch.
-- 
Best regards,
Binguo Bao
From d8d14600e2be0c11f233bf3302a823a8c4f19c2a Mon Sep 17 00:00:00 2001
From: BBG 
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 472 +++
 src/backend/utils/adt/varlena.c  |  42 +++-
 src/include/access/tuptoaster.h  |  97 +++
 src/include/fmgr.h   |   7 +
 4 files changed, 612 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 74233bb..ba05fd1 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static void free_fetch_datum_iterator(FetchDatumIterator iter);
+static void fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static void free_toast_buffer(ToastBuffer *buf);
+static void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+	DetoastIterator iter);
 
 
 /* --
@@ -347,6 +354,119 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* --
+ * init_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * --
+ */
+bool init_detoast_iterator(struct varlena *attr, DetoastIterator iter)
+{
+	struct varatt_external toast_pointer;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		iter->done = false;
+
+		/*
+		 * This is an externally stored datum --- initialize fetch datum iterator
+		 */
+		iter->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			iter->compressed = true;
+
+			/* prepare buffer to received decompressed data */
+			iter->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iter->buf, toast_pointer.va_rawsize, false);
+
+			/* initialize state for pglz_decompress_iterate() */
+			iter->ctrl = 0;
+			iter->ctrlc = INVALID_CTRLC;
+		}
+		else
+		{
+			iter->compressed = false;
+
+			/* point the buffer directly at the raw data */
+			iter->buf = iter->fetch_datum_iterator->buf;
+		}
+		return true;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		 /* indirect pointer --- dereference it */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return init_detoast_iterator(attr, iter);
+
+	}
+	else
+	{
+		/* in-line value -- no iteration used, even if it's compressed */
+		return false;
+	}
+}
+
+
+/* --
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ *
+ * Note: "iter" variable is normally just a local variable in the caller, so
+ * shouldn't free de-TOAST iterator its

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-21 Thread Anastasia Lubennikova

20.08.2019 4:04, Peter Geoghegan wrote:

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
 wrote:


It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.


Thank you for the review.

The new version v8 is attached. Compared to previous version, this patch 
includes
updated btree_xlog_insert() and btree_xlog_split() so that WAL records 
now only contain data

about updated posting tuple and don't require full page writes.
I haven't updated pg_waldump yet. It is postponed until we agree on 
nbtxlog changes.


Also in this patch I renamed all 'compress' keywords to 'deduplicate' 
and did minor cleanup

of outdated comments.

I'm going to look through the patch once more to update nbtxlog 
comments, where needed and

answer to your remarks that are still left in the comments.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit d73c1b8e10177dfb55ff1b1bac999f85d2a0298d
Author: Anastasia 
Date:   Wed Aug 21 20:00:54 2019 +0300

v8-0001-Deduplication-in-nbtree.patch

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..ddc511a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from root page in index \"%s\"",
 			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+	 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 		itid, htid,
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+current = BTreeTupleGetPostingN(itup, i);
+
+if (ItemPointerCompare(current, &last) <= 0)
+{
+	char	   *itid,
+			   *htid;
+
+	itid = psprintf("(%u,%u)", state->targetblock, offset);
+	htid = psprintf("(%u,%u)",
+	ItemPointerGetBlockNumberNoCheck(current),
+	ItemPointerGetOffsetNumberNoCheck(current));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INDEX_CORRUPTED),
+			 errmsg("posting list heap TIDs out of order in index \"%s\"",
+	RelationGetRelationName(state->rel)),
+			 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+itid, htid,
+(uint32) (state->targetlsn >> 32),
+(uint32) state->targetlsn)));
+}
+
+ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+
+/* Fingerprint all elements of posting tuple one by one */
+for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+{
+	onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+	norm = bt_normalize_tuple(state, onetup);
+	bloom_add_element(state->filter, (unsigned char *) norm,
+	  IndexTupleSize(norm));
+	/* Be tidy */
+	if (norm != onet

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund



On August 21, 2019 8:36:34 AM PDT, Robert Haas  wrote:
> We treat LWLockAcquire() as a no-fail operation in many
>places; in my opinion, that elog(ERROR) that we have for too many
>LWLocks should be changed to elog(PANIC) precisely because we do treat
>LWLockAcquire() as no-fail in lots of places in the code, but I think
>I suggested that once and got slapped down, and I haven't had the
>energy to fight about it.

Fwiw, that proposal has my vote.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Cleanup isolation specs from unused steps

2019-08-21 Thread Melanie Plageman
On Mon, Aug 19, 2019 at 7:01 PM Michael Paquier  wrote:

>
> It is rather a pain to pass down custom options to isolationtester.
> For example, I have tested the updated version attached after
> hijacking -n into isolation_start_test().  Ugly hack, but for testing
> that's enough.  Do you make use of this tool in a particular way in
> greenplum?  Just wondering.
>
> (Could it make sense to have long options for isolationtester by the
> way?)
>

In Greenplum, we mainly add new tests to a separate isolation
framework (called isolation2) which uses a completely different
syntax. It doesn't use isolationtester at all. So, I haven't had a use
case to add long options to isolationtester yet :)

-- 
Melanie Plageman


Re: Cleanup isolation specs from unused steps

2019-08-21 Thread Melanie Plageman
On Tue, Aug 20, 2019 at 6:34 PM Michael Paquier  wrote:

> On Tue, Aug 20, 2019 at 09:54:56AM -0400, Alvaro Herrera wrote:
> > On 2019-Aug-20, Tom Lane wrote:
> >> If you can warn in both cases, that'd be OK perhaps.  But Alvaro's
> >> description of the intended use of dry-run makes it sound like
> >> it would be expected for there to be unreferenced steps, since there'd
> >> be no permutations yet in the input.
>
> v2 of the patch warns of any unused steps in dry-run mode.  If no
> permutations are defined in the input spec, all steps get used
> (actually that's not a factorial as the per-session ordering is
> preserved), so I would expect no warnings to be generated and the
> patch does that.  If the test includes some permutations, then I would
> expect dry-run to complain about the steps which are defined in the
> spec but not used.  The patch also does that.  Do you see a problem
> with that?
>
> > Well, Heikki/Kevin's original intention was that if no permutations are
> > specified, then all possible permutations are generated internally and
> > the spec is run with that.  The intended use of dry-run was to do just
> > that (generate all possible permutations) and print that list, so that
> > it could be trimmed down by the test author.  In this mode of operation,
> > all steps are always used, so there'd be no warning printed.  However,
> > when a test file has a largish number of steps then the list is, um, a
> > bit long.  Before the deadlock-test hacking, you could run with such a
> > list anyway and any permutations that caused a blockage would be
> > reported right away as an invalid permutation -- quick enough.
> > Currently it sleeps for absurdly long on those cases, so this is no
> > longer feasible.
> >
> > This is why I say that the current dry-run mode could be removed with no
> > loss of useful functionality.
>
> Hmm.  Even if one does not do something deadlock-specific, the list
> printed could still be useful, no?  This for example works now that I
> look at it:
> ./isolationtester -n < specs/my_spec.spec
>
> I am wondering if we should not actually keep dry_run, but rename it
> to something like --print-permutations to print the set of
> permutations which would be run as part of the spec, and also have an
> option which is able to print out all permutations possible, like
> --print-all-permutations.  Simply ripping out the mode would be fine
> by me as it does not seem to be used, but keeping it around does not
> induce really much extra maintenance cost.
>

So, I think I completely misunderstood the purpose of 'dry-run'. If no
one is using it, having a check for unused steps in dry-run may not be
useful.

+1 to renaming it to --print-permutations and, potentially,
adding --print-all-permutations

-- 
Melanie Plageman


Re: Cleanup isolation specs from unused steps

2019-08-21 Thread Alvaro Herrera
On 2019-Aug-21, Melanie Plageman wrote:

> In Greenplum, we mainly add new tests to a separate isolation
> framework (called isolation2) which uses a completely different
> syntax. It doesn't use isolationtester at all. So, I haven't had a use
> case to add long options to isolationtester yet :)

Is that other framework somehow more capable?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Removing pg_pltemplate and creating "trustable" extensions

2019-08-21 Thread Tom Lane
We've repeatedly kicked around the idea of getting rid of the
pg_pltemplate catalog in favor of keeping that information directly in
the languages' extension files [1][2][3][4].  The primary abstract
argument for that is that it removes a way in which our in-tree PLs
are special compared to out-of-tree PLs, which can't have entries in
pg_pltemplate.  A concrete argument for it is that it might simplify
fixing the python-2-vs-python-3 mess, since one of the issues there
is that pg_pltemplate has hard-wired knowledge that "plpythonu" is
Python 2.  Accordingly, attached is a patch series that ends by
removing that catalog.

As I noted in [2], the main stumbling block to doing this is that
the code associated with pg_pltemplate provides a privilege override
mechanism that allows non-superuser database owners to install trusted
PLs.  For backwards compatibility if nothing else, we probably want to
keep that ability, though it'd be nice if it weren't such a hard-wired
behavior.

Patch 0001 below addresses this problem by inventing a concept of
"trustable" (not necessarily trusted) extensions.  An extension that
would normally require superuser permissions (e.g., because it creates
C functions) can now be installed by a non-superuser if (a) it is
marked trustable in the extension's control file, AND (b) it is
listed as trusted in one of two new GUCs, trusted_extensions_dba and
trusted_extensions_anyone.  (These names could stand a visit to the
bikeshed, no doubt.)  Extensions matching trusted_extensions_dba can
be installed by a database owner, while extensions matching
trusted_extensions_anyone can be installed by anybody.  The default
settings of these GUCs provide backwards-compatible behavior, but
they can be adjusted to provide more or less ability to install
extensions.  (This design is basically what Andres advocated in [2].)

In this patch series, I've only marked the trusted-PL extensions as
trustable, but we should probably make most of the contrib extensions
trustable --- not, say, adminpack, but surely most of the datatype
and transform modules could be marked trustable.  (Maybe we could
make the default GUC settings more permissive, too.)

As coded, the two GUCs are not lists of extension names but rather
regexes.  You could use them as lists, eg "^plperl$|^plpgsql$|^pltcl$"
but that's a bit tedious, especially if someone wants to trust most
or all of contrib.  I am a tad worried about user-friendliness of
this notation, but I think we need something with wild-cards, and
that's the only wild-card-capable matching engine we have available
at a low level.

You might wonder why bother with the trustable flag rather than just
relying on the GUCs.  The answer is mostly paranoia: I'm worried about
somebody writing e.g. "plperl" with no anchors and not realizing that
that will match "plperlu" as well.  Anyway, since we're talking about
potential escalation-to-superuser security problems, I think having
both belt and suspenders protection on untrusted languages is wise.

There are no regression tests for this functionality in 0001,
but I added one in 0002.

Patch 0002 converts all the in-tree PLs to use fully specified
CREATE LANGUAGE and not rely on pg_pltemplate.

I had a better idea about how to manage permissions than what was
discussed in [3]; we can just give ownership of the language
object to the user calling CREATE EXTENSION.  Doing it that way
means that we end up with exactly the same catalog state as we
do in existing releases.  And that should mean that we don't have
to treat this as an extension version upgrade.  So I just modified
the 1.0 scripts in-place instead of adding 1.0--1.1 scripts.  It
looks to me like there's no need to touch the from-unpackaged
scripts, either.  And by the same token this isn't really an issue
for pg_upgrade.

(I noticed while testing this that pg_upgrade fails to preserve
ownership on extensions, but that's not new; this patch is not
making that situation any better or worse than it was.  Still,
maybe we oughta try to fix that sometime soon too.)

Patch 0003 removes CREATE LANGUAGE's reliance on pg_pltemplate.
CREATE LANGUAGE without parameters is now interpreted as
CREATE EXTENSION, thus providing a forward compatibility path
for old dump files.

Note: this won't help for *really* old dump files, ie those containing
CREATE LANGUAGE commands that do have parameters but the parameters are
wrong according to modern usage.  This is a hazard for dumps coming
from 8.0 or older servers; we invented pg_pltemplate in 8.1 primarily
as a way of cleaning up such dumps [5].  I think that that's far enough
back that we don't have to worry about how convenient it will be to go
from 8.0-or-older to v13-or-newer in one jump.

Finally, patch 0004 removes the now-unused catalog and cleans up some
incidental comments referring to it.

Once this is in, we could start thinking about whether we actually
want to change anything about plpython in the near future.


Re: "ago" times on buildfarm status page

2019-08-21 Thread Andrew Dunstan


On 8/21/19 11:07 AM, Andrew Dunstan wrote:
> On 8/21/19 9:55 AM, Tom Lane wrote:
>> The real problem with that column though is that it relies on run start
>> times that are self-reported by the buildfarm clients, and some of them
>> have system clocks that are many hours off reality.  What *I'd* like to
>> see is for the column to contain time of receipt of the buildfarm report
>> at the server, less the measured runtime of the test.
>
> That might be possible. I'll put it on my list of things to do. It's not
> happening any time soon, though.
>
>


What I have done quickly is to store a measure of the clock skew. We
already calculated it but we didn't store it. Initial indications are
that only a few have significant skew.


Still, if we simply added the skew to the snapshot time that might be
enough to achieve what you want. That would be a one line change, I think.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: pg_upgrade fails with non-standard ACL

2019-08-21 Thread Bruce Momjian
On Tue, Aug 20, 2019 at 04:38:18PM +0300, Anastasia Lubennikova wrote:
> > Solving this in pg_upgrade does seem like it's probably the better
> > approach rather than trying to do it in pg_dump.  Unfortunately, that
> > likely means that all we can do is have pg_upgrade point out to the user
> > when something will fail, with recommendations on how to address it, but
> > that's also something users are likely used to and willing to accept,
> > and puts the onus on them to consider their ACL decisions when we're
> > making catalog changes, and it keeps these issues out of pg_dump.
> 
> 
> I wrote a prototype to check API and ACL compatibility (see attachment).
> In the current implementation it fetches the list of system procedures for
> both old and new clusters
> and reports deleted functions or ACL changes during pg_upgrade check:
> 
> 
> Performing Consistency Checks
> -
> ...
> Checking for system functions API compatibility
> dbname postgres : check procsig is equal pg_stop_backup(), procacl not equal
> {anastasia=X/anastasia,backup=X/anastasia} vs {anastasia=X/anastasia}
> dbname postgres : procedure pg_stop_backup(exclusive boolean, OUT lsn
> pg_lsn, OUT labelfile text, OUT spcmapfile text) doesn't exist in
> new_cluster
> dbname postgres : procedure pg_switch_xlog() doesn't exist in new_cluster
> dbname postgres : procedure pg_xlog_replay_pause() doesn't exist in
> new_cluster
> dbname postgres : procedure pg_xlog_replay_resume() doesn't exist in
> new_cluster
> ...
> 
> I think it's a good first step in the right direction.
> Now I have questions about implementation details:
> 
> 1) How exactly should we report this incompatibility to a user?
> I think it's fine to leave the warnings and also write some hint for the
> user by analogy with other checks.
> "Reset ACL on the problem functions to default in the old cluster to
> continue"

Yes, I think it is good to at least throw an error during --check so
they don't have to find out during a live upgrade.  Odds are it will
require manual repair.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Change ereport level for QueuePartitionConstraintValidation

2019-08-21 Thread Sergei Kornilov
Hello

I noticed appveyor build on windows is not happy:

> perl buildsetup.pl
> Could not determine contrib module type for alter_table
>  at buildsetup.pl line 38.

But I have no idea why. I can't check on windows. Possible I miss some change 
while adding new module to tree. Will check. Please let me know if root of such 
error is known.

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/26831382
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.53294

regards, Sergei




Re: configure still looking for crypt()?

2019-08-21 Thread Peter Eisentraut
On 2019-08-20 16:07, Tom Lane wrote:
> Peter Eisentraut  writes:
>> I noticed that configure is still looking for crypt() and crypt.h.
>> Isn't that long obsolete?
>> If so, I suggest to remove it with the attached patch.
> 
> +1

done

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: "ago" times on buildfarm status page

2019-08-21 Thread Tom Lane
Andrew Dunstan  writes:
> What I have done quickly is to store a measure of the clock skew. We
> already calculated it but we didn't store it. Initial indications are
> that only a few have significant skew.

Oh, I didn't know that the server had the ability to measure that.

(Yes, I agree that there are just a couple with big skews at any
one time.)

> Still, if we simply added the skew to the snapshot time that might be
> enough to achieve what you want. That would be a one line change, I think.

+1

regards, tom lane




Re: Change ereport level for QueuePartitionConstraintValidation

2019-08-21 Thread Tom Lane
Sergei Kornilov  writes:
> I noticed appveyor build on windows is not happy:
>> perl buildsetup.pl
>> Could not determine contrib module type for alter_table
>> at buildsetup.pl line 38.

> But I have no idea why. I can't check on windows. Possible I miss some change 
> while adding new module to tree. Will check. Please let me know if root of 
> such error is known.

> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/26831382
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.53294

grep is your friend: that message is coming out of Mkvcbuild.pm's
AddContrib.  (No idea why perl is fingering someplace else.)  Apparently
you need to have one of MODULE_big, MODULES, or PROGRAM defined, unless
you add the module to @contrib_excludes to keep it from being built.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas
On Wed, Aug 14, 2019 at 2:57 AM Andres Freund  wrote:
> - My reading of the current xact.c integration is that it's not workable
>   as is. Undo is executed outside of a valid transaction state,
>   exceptions aren't properly undone, logic would need to be duplicated
>   to a significant degree, new kind of critical section.

Regarding this particular point:

ReleaseResourcesAndProcessUndo() is only supposed to be called after
AbortTransaction(), and the additional steps it performs --
AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
That's not crazy; the other steps in Cleanup(Sub)Transaction() look
like stuff that's intended to be performed when we're totally done
with this TransactionState stack entry, whereas these things are
slightly higher-level cleanups that might even block undo (e.g.
undropped portal prevents orphaned file cleanup).  Granted, there are
no comments explaining why those particular cleanup steps are
performed here, and it's possible some other approach is better, but I
think perhaps it's not quite as flagrantly broken as you think.

I am also not convinced that semi-critical sections are a bad idea,
although the if (SemiCritSectionCount > 0) test at the start of
ReleaseResourcesAndProcessUndo() looks wrong.  To roll back a
subtransaction, we must perform undo in the foreground, and if that
fails, the toplevel transaction can't be allowed to commit, full stop.
Since we expect this to be a (very) rare scenario, I don't know why
escalating to FATAL is a catastrophe.  The only other option is to do
something along the lines of SxactIsDoomed(), where we force all
subsequent commits (and sub-commits?) within the toplevel xact to
fail. You can argue that the latter is a better user experience, and
for SSI I certainly agree, but this case isn't quite the same: there's
a good chance we're dealing with a corrupted page or system
administrator intervention to try to kill off a long-running undo
task, and continuing in such cases seems a lot more dubious than after
a serializability failure, where retrying is the expected recovery
mode. The other case is where toplevel undo for a temporary table
fails.  It is unclear to me what, other than FATAL, could suffice
there.  I guess you could just let the session continue and leave the
transaction undone, leaving whatever MVCC machinery the table AM may
have look through it, but that sounds inferior to me. Rip the bandaid
off.

Some general complaints from my side about the xact.c changes:

1. The code structure doesn't seem quite right.  For example:

1a. ProcessUndoRequestForEachLogCat has a try/catch block, but it
seems to me that the job of a try/catch block is to provide structured
error-handling for code for resources for which there's no direct
handling in xact.c or resowner.c.  Here, we're inside of xact.c, so
why are we adding a try/catch block
1b. ReleaseResourcesAndProcessUndo does part of the work of cleaning
up a failed transaction but not all of it, the rest being done by
AbortTransaction, which is called before entering it, plus it also
kicks off the actual undo work.  I would expect a cleaner division of
responsibility.
1c. Having an undo request per UndoLogCategory rather than one per
transaction doesn't seem right to me; hopefully that will get cleaned
up when the undorequest.c stuff I sent before is integrated.
1d. The code at the end of FinishPreparedTransaction() seems to expect
that the called code will catch any error, but that clearing the error
state might need to happen here, and also that we should fire up a new
transaction; I suspect, but am not entirely sure, that that is not the
way it should work.  The code added earlier in that function also
looks suspicious, because it's filling up what is basically a
high-level control function with a bunch of low-level undo-specific
details.  In both places, the undo-specific concerns probably need to
be better-isolated.

2. Signaling is done using some odd-looking mechanisms.  For instance:

2a. The SemiCritSectionCount > 0 test at the top of
ReleaseResourcesAndProcessUndo that I complained about earlier looks
like a guard against reentrancy, but that must be the wrong way to get
there; it makes it impossible to reuse what is ostensibly a
general-purpose facility for any non-undo related purpose without
maybe breaking something.
2b. ResetUndoActionsInfo() is called from a bunch of place, but only 2
of those places have a comment explaining why, and the function
comment is pretty unilluminating. This looks like some kind of
signaling machinery, but it's not very clear to me what it's actually
trying to do.
2c. ResourceOwnerReleaseInternal() is directly calling
NeedToPerformUndoActions(), which feels like a layering violation.
2d. I'm not really sure that TRANS_UNDO is serving any useful purpose;
I think we need TBLOCK_UNDO and TBLOCK_SUBUNDO, but I'm not really
sure TRANS_UNDO is doing anything useful; the cha

Re: Adding a test for speculative insert abort case

2019-08-21 Thread Melanie Plageman
On Wed, Aug 7, 2019 at 1:47 PM Melanie Plageman 
wrote:

>
>
> On Wed, Jul 24, 2019 at 11:48 AM Andres Freund  wrote:
>
>> > diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec
>> b/src/test/isolation/specs/insert-conflict-specconflict.spec
>> > index 3a70484fc2..7f29fb9d02 100644
>> > --- a/src/test/isolation/specs/insert-conflict-specconflict.spec
>> > +++ b/src/test/isolation/specs/insert-conflict-specconflict.spec
>> > @@ -10,7 +10,7 @@ setup
>> >  {
>> >   CREATE OR REPLACE FUNCTION blurt_and_lock(text) RETURNS text
>> IMMUTABLE LANGUAGE plpgsql AS $$
>> >   BEGIN
>> > -RAISE NOTICE 'called for %', $1;
>> > +RAISE NOTICE 'blurt_and_lock() called for %', $1;
>> >
>> >   -- depending on lock state, wait for lock 2 or 3
>> >  IF
>> pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
>> > @@ -23,9 +23,16 @@ setup
>> >  RETURN $1;
>> >  END;$$;
>> >
>> > +CREATE OR REPLACE FUNCTION blurt_and_lock2(text) RETURNS text
>> IMMUTABLE LANGUAGE plpgsql AS $$
>> > +BEGIN
>> > +RAISE NOTICE 'blurt_and_lock2() called for %', $1;
>> > +PERFORM
>> pg_advisory_xact_lock(current_setting('spec.session')::int, 4);
>> > +RETURN $1;
>> > +END;$$;
>> > +
>>
>> Any chance for a bit more descriptive naming than *2? I can live with
>> it, but ...
>>
>>
> Taylor Vesely and I paired on updating this test, and, it became clear
> that the way that the steps and functions are named makes it very
> difficult to understand what the test is doing. That is, I helped
> write this test and, after a month away, I could no longer understand
> what it was doing at all.
>
> We changed the text of the blurts to "acquiring advisory lock ..."
> from "blocking" because we realized that this would print even when
> the lock was acquired immediately successfully, which is a little
> misleading for the reader.
>
> He's taking a stab at some renaming/refactoring to make it more clear
> (including renaming blurt_and_lock2())
>

So, Taylor and I had hoped to rename the steps to something more specific
that
told the story of what this test is doing and made it more clear.
Unfortunately,
our attempt to do that didn't work and made step re-use very difficult.
Alas, we decided the original names were less confusing.

My idea for renaming blurt_and_lock2() was actually to rename
blurt_and_lock()
to blurt_and_lock_123() -- since it always takes a lock on 1,2, or 3.
Then, I could name the second function, which locks 4, blurt_and_lock_4().
What do you think?

I've attached a rebased patch updated with the new function names.


>
>
>>
>> > +step "controller_print_speculative_locks" { SELECT
>> locktype,classid,objid,mode,granted FROM pg_locks WHERE
>> locktype='speculative
>> > +token' ORDER BY granted; }
>>
>> I think showing the speculative locks is possibly going to be unreliable
>> - the release time of speculative locks is IIRC not that reliable. I
>> think it could e.g. happen that speculative locks are held longer
>> because autovacuum spawned an analyze in the background.
>>
>>
> I actually think having the "controller_print_speculative_locks"
> wouldn't be a problem because we have not released the advisory lock
> on 4 in s2 that allows it to complete its speculative insertion and so
> s1 will still be in speculative wait.
>
> The step that might be a problem if autovacuum delays release of the
> speculative locks is the "controller_show" step, because, at that
> point, if the lock wasn't released, then s1 would still be waiting and
> wouldn't have updated.
>

So, what should we do about this? Do you agree that the "controller_show"
step
would be a problem?

-- 
Melanie Plageman


v4-0001-Test-additional-speculative-conflict-scenarios.patch
Description: Binary data


XPRS

2019-08-21 Thread Thomas Munro
Hello,

After rereading some old papers recently, I wanted to share some
thoughts about XPRS and modern PostgreSQL.  XPRS stood for "eXtended
Postgres on RAID and Sprite", and was a research project done nearly
three decades ago at Berkeley by the POSTGRES group working with
operating system researchers, on a shared memory machine with a lot of
CPUs for the time (12).

As far as I can tell (and if anyone knows how to find out, I'd love to
know), the parallel query parts of the XPRS system described in
various writings by Wei Hong and Michael Stonebraker are actually
present in the POSTGRES 4.2 tarball and were removed in Postgres95.
Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we
know that XPRS ran on a Sequent Symmetry; the parallel hash join
algorithm matches the description given in Hong's doctoral thesis; the
page and range based parallel scans seen in various places also seem
to match.

Hong's thesis covers a lot of material and I certainly haven't
understood all of it, but basically it's about how to share CPU, IO
bandwidth and memory out fairly and dynamically at execution time so
you're using the whole system efficiently.  Facets of this problem
obviously keep coming up on this mailing list (see practically any
discussion of parallel degree, admission control, prefetch or
work_mem, all of which we punt on by deferring to user supplied GUCs
and scan size-based heuristics).

Here are three things I wanted to highlight from Hong's 1991 paper[1]
(later writings elaborate greatly but this one is short and much
easier to read than the thesis and sets the scene):

1.  "The overall performance goal of a multiprocessor database system
is to obtain increased throughput as well as reduced response time in
a multiuser environment.  The objective function that XPRS uses for
query optimization is a combination of resource consumption and
response time as follows:  cost = resource_consumption + w *
response_time, where w is a system-specific weighting factor."

2.  The "Buffer-Size-Independent Hypothesis" (here meaning work_mem):
"The choice of the best sequential plan is insensitive to the amount
of buffer space available as long as the buffer size is above the hash
join threshold" (with a caveat about localised special cases that can
be handled by choosing alternative subplans at runtime).

3.  The "Two-Phase Hypothesis": "The best parallel plan is a
parallelization of the best sequential plan."

I read all of that a while back while working on bits of parallel
query machinery (though I only realised after the fact that I'd
implemented parallel hash the same way as Hong did 27 years earlier,
that is, shared no-partition, which is now apparently back in vogue
due to the recent ubiquity of high core count shared memory systems,
so that every server looks a bit like a Sequent Symmetry; for example
Oracle is rumoured to have a "parallel shared hash join" like ours in
the pipeline).  I didn't understand the context or importance of XPRS,
though, until I read this bit of Hellerstein's "Looking Back at
Postgres"[2]:

"In principle, parallelism “blows up” the plan space for a query
optimizer by making it multiply the traditional choices made during
query optimization (data access, join algorithms, join orders) against
all possible ways of parallelizing each choice. The basic idea of what
Stonebraker called “The Wei Hong Optimizer” was to cut the problem in
two: run a traditional single-node query optimizer in the style of
System R, and then “parallelize” the resulting single-node query plan
by scheduling the degree of parallelism and placement of each operator
based on data layouts and system configuration. This approach is
heuristic, but it makes parallelism an additive cost to traditional
query optimization, rather than a multiplicative cost.

Although “The Wei Hong Optimizer” was designed in the context of
Postgres, it became the standard approach for many of the parallel
query optimizers in industry."

I don't know what to think about the buffer-size-independent
hypothesis, but the two-phase hypothesis and the claim that is is the
standard approach caught my attention.  Firstly, I don't think the
hypothesis holds on our system currently, because (for example) we
lack parallel merge joins and sorts, so you couldn't parallelise such
serial plans, and yet we'd already have thrown away a hash join based
plan that would be vastly better in parallel.  That might be just an
implementation completeness problem.  I wonder what fundamental
problems lurk here.  (Perhaps the non-availability of globally unique
partial paths?)  Anyway, AFAICS we do the exact thing Hong wanted to
avoid: we plan parallel queries as extra paths at planning time.  We
don't really suffer too much of a planning explosion though, because
we don't consider different parallel degrees.  If we did, because our
cost model doesn't include any penalty for resource usage, I suspect
we'd always go for the maximum number of workers becaus

Does TupleQueueReaderNext() really need to copy its result?

2019-08-21 Thread Thomas Munro
Hi,

A comment in tqueue.c says that the bytes return by shm_mq_receive()
"had better be sufficiently aligned", before assigning the pointer to
htup.t_data.  Then it copies htup and returns the copy (and it did so
in the earlier version that had all the remapping stuff, too, but
sometimes it deformed it directly so it really did need to be suitably
aligned in that case IIUC).

Given that shm_mq.c proudly documents that it avoids copying the data
on the receiving side (unless it has to reconstruct a message that was
split up), and given that it promises that the pointed-to data remains
valid until your next call, it seems that it should be safe to return
a pointer to the same HeapTupleData object every time (perhaps a
member of the TupleQueueReader struct) and just adjust its t_data and
t_len members every time, so that the gather node emits tuples
directly from the shared memory queue (and then of course tell the
slot not to pfree()).  Alternatively, if the value returned by
shm_mq_receive() is not really suitably aligned, then the comment is a
bit misleading.

-- 
Thomas Munro
https://enterprisedb.com




RE: Why overhead of SPI is so large?

2019-08-21 Thread Tsunakawa, Takayuki
From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
> PL/pgSQL:   29044.361 ms
> C/SPI:  22785.597 ms
> 
> The fact that difference between PL/pgSQL and function implemented in C
> using SPI is not so large  was expected by me.

This PL/pgSQL overhead is not so significant compared with the three times, but 
makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that 
compiles the PL/SQL logic to native code.  I've seen a few dozen percent speed 
up.


Regards
Takayuki Tsunakawa



Re: Remove page-read callback from XLogReaderState.

2019-08-21 Thread Kyotaro Horiguchi
Thank you for the suggestion, Heikki.

At Mon, 29 Jul 2019 22:39:57 +0300, Heikki Linnakangas  wrote 
in 
> On 12/07/2019 10:10, Kyotaro Horiguchi wrote:
> >> Just FYI, to me this doesn't clearly enough look like an improvement,
> >> for a change of this size.
> > Thanks for the opiniton. I kinda agree about size but it is a
> > decision between "having multiple callbacks called under the
> > hood" vs "just calling a series of functions".  I think the
> > patched XlogReadRecord is easy to use in many situations.
> > It would be better if I could completely refactor the function
> > without the syntax tricks but I think the current patch is still
> > smaller and clearer than overhauling it.
> 
> I like the idea of refactoring XLogReadRecord() to not use a callback,
> and return a XLREAD_NEED_DATA value instead. It feels like a nicer,
> easier-to-use, interface, given that all the page-read functions need
> quite a bit of state and internal logic themselves. I remember that I
> felt that that would be a nicer interface when we originally extracted
> xlogreader.c into a reusable module, but I didn't want to make such
> big changes to XLogReadRecord() at that point.
> 
> I don't much like the "continuation" style of implementing the state
> machine. Nothing wrong with such a style in principle, but we don't do
> that anywhere else, and the macros seem like overkill, and turning the

Agreed that it's a kind of ugly. I could overhaul the logic to
reduce state variables, but I thought that it would make the
patch hardly reviewable.

The "continuation" style was intended to impact the main path's
shape as small as possible. For the same reason I made variables
static instead of using individual state struct or reducing state
variables. (And it the style was fun for me:p)

> local variables static is pretty ugly. But I think XLogReadRecord()
> could be rewritten into a more traditional state machine.
> 
> I started hacking on that, to get an idea of what it would look like
> and came up with the attached patch, to be applied on top of all your
> patches. It's still very messy, it needs quite a lot of cleanup before
> it can be committed, but I think the resulting switch-case state
> machine in XLogReadRecord() is quite straightforward at high level,
> with four states.

Sorry for late reply. It seems less messy than I thought it could
be if I refactored it more aggressively.

> I made some further changes to the XLogReadRecord() interface:
> 
> * If you pass a valid ReadPtr (i.e. the starting point to read from)
> * argument to XLogReadRecord(), it always restarts reading from that
> * record, even if it was in the middle of reading another record
> * previously. (Perhaps it would be more convenient to provide a separate
> * function to set the starting point, and remove the RecPtr argument
> * from XLogReadRecord altogther?)

Seems reasonable. randAccess property was replaced with the
state.PrevRecPtr = Invalid. It is easier to understand for me.

> * XLogReaderState->readBuf is now allocated and controlled by the
> * caller, not by xlogreader.c itself. When XLogReadRecord() needs data,
> * the caller makes the data available in readBuf, which can point to the
> * same buffer in all calls, or the caller may allocate a new buffer, or
> * it may point to a part of a larger buffer, whatever is convenient for
> * the caller. (Currently, all callers just allocate a BLCKSZ'd buffer,
> * though). The caller also sets readPagPtr, readLen and readPageTLI to
> * tell XLogReadRecord() what's in the buffer. So all these read* fields
> * are now set by the caller, XLogReadRecord() only reads them.

The caller knows how many byes to be read. So the caller provides
the required buffer seems reasonable.

> * In your patch, if XLogReadRecord() was called with state->readLen ==
> * -1, XLogReadRecord() returned an error. That seemed a bit silly; if an
> * error happened while reading the data, why call XLogReadRecord() at
> * all? You could just report the error directly. So I removed that.

Agreed. I forgot to move the error handling to more proper location.

> I'm not sure how intelligible this patch is in its current state. But
> I think the general idea is good. I plan to clean it up further next
> week, but feel free to work on it before that, either based on this
> patch or by starting afresh from your patch set.

I think you diff is intelligible enough for me. I'll take this if
you haven't done. Anyway I'm staring on this.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Why overhead of SPI is so large?

2019-08-21 Thread Kyotaro Horiguchi
Hello.

At Wed, 21 Aug 2019 19:41:08 +0300, Konstantin Knizhnik 
 wrote in 

> Hi, hackers.
> 
> One of our customers complains about slow execution of PL/pgSQL
> functions comparing with Oracle.
> So he wants to compile PL/pgSQL functions (most likely just-in-time
> compilation).
> Certainly interpreter adds quite large overhead comparing with native
> code (~10 times) but
> most of PL/pgSQL functions are just running some SQL queues and
> iterating through results.
> 
> I can not believe that JIT can significantly speed-up such functions.
> So I decided to make simple experiment:  I created large enough table
> and implemented functions
> which calculates norm of one column in different languages.
> 
> Results are frustrating (at least for me):
> 
> PL/pgSQL:   29044.361 ms
> C/SPI:          22785.597 ms
> С/coreAPI: 2873.072 ms
> PL/Lua:        33445.520 ms
> SQL:              7397.639 ms (with parallel execution disabled)
> 
> The fact that difference between PL/pgSQL and function implemented in
> C using SPI is not so large  was expected by me.
> But why it is more than 3 time slower than correspondent SQL query?
> The answer seems to be in the third result: the same function in C
> implemented without SPI (usign table_beginscan/heap_getnext)
> Looks like SPI adds quite significant overhead.
> And as far as almost all PL languages are using SPI, them all suffer
> from it.

As far as looking the attached spitest.c, it seems that the
overhead comes from cursor operation, not from SPI. As far as
spitest.c goes, cursor is useless.  "SQL" and C/coreAPI seem to
be scanning over the result from *a single* query. If that's
correct, why don't you use SPI_execute() then scan over
SPI_tuptable?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Cleanup isolation specs from unused steps

2019-08-21 Thread Michael Paquier
On Wed, Aug 21, 2019 at 11:07:19AM -0700, Melanie Plageman wrote:
> So, I think I completely misunderstood the purpose of 'dry-run'. If no
> one is using it, having a check for unused steps in dry-run may not be
> useful.

Okay.  After sleeping on it and seeing how this thread evolves, it
looks that we have more arguments in favor of just let dry-run go to
the void.  So attached is an updated patch set:
- 0001 removes the dry-run mode from isolationtester.
- 0002 cleans up the specs of unused steps and adds the discussed
sanity checks, as proposed for this thread.
--
Michael
From d0e756bba3668af0b1f40b0b1d5bf198b83a97fd Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 22 Aug 2019 11:46:29 +0900
Subject: [PATCH v3 1/2] Remove dry-run mode from isolationtester

---
 src/test/isolation/isolationtester.c | 31 +---
 1 file changed, 1 insertion(+), 30 deletions(-)

diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index f98bb1cf64..66ebe3b27e 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -31,9 +31,6 @@ static int *backend_pids = NULL;
 static const char **backend_pid_strs = NULL;
 static int	nconns = 0;
 
-/* In dry run only output permutations to be run by the tester. */
-static int	dry_run = false;
-
 static void run_testspec(TestSpec *testspec);
 static void run_all_permutations(TestSpec *testspec);
 static void run_all_permutations_recurse(TestSpec *testspec, int nsteps,
@@ -76,13 +73,10 @@ main(int argc, char **argv)
 	int			nallsteps;
 	Step	  **allsteps;
 
-	while ((opt = getopt(argc, argv, "nV")) != -1)
+	while ((opt = getopt(argc, argv, "V")) != -1)
 	{
 		switch (opt)
 		{
-			case 'n':
-dry_run = true;
-break;
 			case 'V':
 puts("isolationtester (PostgreSQL) " PG_VERSION);
 exit(0);
@@ -144,16 +138,6 @@ main(int argc, char **argv)
 		}
 	}
 
-	/*
-	 * In dry-run mode, just print the permutations that would be run, and
-	 * exit.
-	 */
-	if (dry_run)
-	{
-		run_testspec(testspec);
-		return 0;
-	}
-
 	printf("Parsed test spec with %d sessions\n", testspec->nsessions);
 
 	/*
@@ -449,19 +433,6 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 	Step	  **waiting;
 	Step	  **errorstep;
 
-	/*
-	 * In dry run mode, just display the permutation in the same format used
-	 * by spec files, and return.
-	 */
-	if (dry_run)
-	{
-		printf("permutation");
-		for (i = 0; i < nsteps; i++)
-			printf(" \"%s\"", steps[i]->name);
-		printf("\n");
-		return;
-	}
-
 	waiting = pg_malloc(sizeof(Step *) * testspec->nsessions);
 	errorstep = pg_malloc(sizeof(Step *) * testspec->nsessions);
 
-- 
2.23.0

From 1ee92ce4cf359c9f6e7f8ce32e4057ed9bb0412e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 22 Aug 2019 11:51:30 +0900
Subject: [PATCH v3 2/2] Improve detection of unused steps in isolation specs

This is useful for developers to find out if an isolation spec is
over-engineering or if it needs more work by warning at the end of a
test run if a step is not used, generating a failure with extra diffs.

While on it, clean up all the specs which include steps not used in any
permutations.
---
 src/test/isolation/isolationtester.c| 17 +
 src/test/isolation/isolationtester.h|  1 +
 src/test/isolation/specparse.y  |  1 +
 src/test/isolation/specs/freeze-the-dead.spec   |  3 ---
 .../specs/insert-conflict-do-nothing.spec   |  1 -
 .../specs/insert-conflict-do-update-2.spec  |  1 -
 .../specs/insert-conflict-do-update.spec|  1 -
 src/test/isolation/specs/sequence-ddl.spec  |  1 -
 .../specs/tuplelock-upgrade-no-deadlock.spec|  3 ---
 9 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 66ebe3b27e..fe0315be77 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -235,10 +235,23 @@ static int *piles;
 static void
 run_testspec(TestSpec *testspec)
 {
+	int		i;
+
 	if (testspec->permutations)
 		run_named_permutations(testspec);
 	else
 		run_all_permutations(testspec);
+
+	/*
+	 * Verify that all steps have been used, complaining about anything
+	 * defined but not used.
+	 */
+	for (i = 0; i < testspec->nallsteps; i++)
+	{
+		if (!testspec->allsteps[i]->used)
+			fprintf(stderr, "unused step name: %s\n",
+	testspec->allsteps[i]->name);
+	}
 }
 
 /*
@@ -438,7 +451,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 
 	printf("\nstarting permutation:");
 	for (i = 0; i < nsteps; i++)
+	{
+		/* Track the permutation as in-use */
+		steps[i]->used = true;
 		printf(" %s", steps[i]->name);
+	}
 	printf("\n");
 
 	/* Perform setup */
diff --git a/src/test/isolation/isolationtester.h b/src/test/isolation/isolationtester.h
index 7f91e6433f..d9d2a14ecf 100644
--- a/src/test/isolation/isolationtester.h
+++ b/src/test/isolation/isolationtester.h
@@

Re: [proposal] de-TOAST'ing using a iterator

2019-08-21 Thread John Naylor
On Thu, Aug 22, 2019 at 12:10 AM Binguo Bao  wrote:
> [v9 patch]

Thanks, looks good. I'm setting it to ready for committer.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Grouping isolationtester tests in the schedule

2019-08-21 Thread Michael Paquier
On Wed, Aug 07, 2019 at 03:17:02PM +0300, Heikki Linnakangas wrote:
> On 07/08/2019 14:42, Thomas Munro wrote:
>> I think I'd put nowait and skip locked under a separate category "FOR
>> UPDATE" or "row locking" or something, but maybe that's just me... can
>> you call that stuff DML?
> 
> Yeah, I guess SELECT FOR UPDATE isn't really DML. Separate "Row locking"
> category works for me. Or maybe "Concurrent DML and row locking". There is
> also DML in some of those tests.

Or would it make sense to group the nowait and skip-locked portion
with the multixact group, then keep the DML-specific stuff together?
There is a test called update-locked-tuple which could enter into the
"row locking" group, and the skip-locked tests have references to
multixact locks.  So I think that I would group all that into a single
group: "multixact and row locking".
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-21 Thread Michael Paquier
On Wed, Aug 21, 2019 at 12:25:22PM -0400, Stephen Frost wrote:
> That'd be the kind of thing that I was really hoping we could provide a
> common library for.

Indeed.  There could be many use cases for that.  Most of the parsing
logic is in guc-file.l.  There is little dependency to elog() and
there is some handling for backend-side fd and their cleanup, but that
looks doable to me without too many ifdef FRONTEND.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-21 Thread Noah Misch
On Wed, Aug 21, 2019 at 04:32:38PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
>  wrote in 
> <20190819.185959.118543656.horikyota@gmail.com>
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> > <20190818035230.gb3021...@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
>  
> Now TwoPhaseFileHeader has two new members for pending syncs. It
> is useless on wal-replay, but that is needed for commit-prepared.

Syncs need to happen in PrepareTransaction(), not in commit-prepared.  I wrote
about that in https://postgr.es/m/20190820060314.ga3086...@rfd.leadboat.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Wed, Aug 21, 2019 at 9:04 PM Robert Haas  wrote:
>
> On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar  wrote:
> > I have already attempted that part and I feel it is not making code
> > any simpler than what we have today.  For packing, it's fine because I
> > can process all the member once and directly pack it into one memory
> > chunk and I can insert that to the buffer by one call of
> > InsertUndoBytes and that will make the code simpler.
>
> OK...

>
> The bigger issue here is that we don't seem to be making very much
> progress toward improving the overall design.  Several significant
> improvements have been suggested:
>
> 1. Thomas suggested quite some time ago that we should make sure that
> the transaction header is the first optional header.

I think this is already done at least 2-3 version before.  So now we
are updating the transaction header directly by writing at that offset
and we don't need staging for this.
>  If we do that,
> then I'm not clear on why we even need this incremental unpacking
> stuff any more. The only reason for all of this machinery was so that
> we could find the transaction header at an unknown offset inside a
> complex record format; there is, if I understand correctly, no other
> case in which we want to incrementally decode a record.
> But if the
> transaction header is at a fixed offset, then there seems to be no
> need to even have incremental decoding at all.  Or it can be very
> simple, with three stages: (1) we don't yet have enough bytes to
> figure out how big the record is; (2) we have enough bytes to figure
> out how big the record is and we have figured that out but we don't
> yet have all of those bytes; and (3) we have the whole record, we can
> decode the whole thing and we're done.

We can not know the complete size of the record even by reading the
header because we have a payload that is variable part and payload
length are stored in the payload header which again can be at random
offset.   But, maybe we can still follow this idea which will make
unpacking far simpler. I have a few ideas
a) Store payload header right after the transaction header so that we
can easily know the complete record size.
b) Once we decode the first header by uur_info we can compute an exact
offset of the payload header and from there we can know the record
size.

>
> 2. Based on a discussion with Thomas, I suggested the GHOB stuff,
> which gets rid of the idea of transaction headers inside individual
> records altogether; instead, there is one undo blob per transaction
> (or maybe several if we overflow to another undo log) which begins
> with a sentinel byte that identifies it as owned by a transaction, and
> then the transaction header immediately follows that without being
> part of any record, and the records follow that data.  As with the
> previous idea, this gets rid of the need for incremental decoding
> because it gets rid of the need to find the transaction header inside
> of a bigger record. As Thomas put it to me off-list, it puts the
> records inside of a larger chunk of space owned by the transaction
> instead of putting the transaction header inside of some particular
> record; that seems more correct than the way we have it now.
>
> 3. Heikki suggested that the format should be much more generic and
> that more should be left up to the AM.  While neither Andres nor I are
> convinced that we should go as far in that direction as Heikki is
> proposing, the idea clearly has some merit, and we should probably be
> moving that way to some degree. For instance, the idea that we should
> store a block number and TID is a bit sketchy when you consider that a
> multi-insert operation really wants to store a TID list. The zheap
> tree has a ridiculous kludge to work around that problem; clearly we
> need something better.  We've also mentioned that, in the future, we
> might want to support TIDs that are not 6 bytes, and that even just
> looking at what's currently under development, zedstore wants to treat
> TIDs as 48-bit opaque quantities, not a 4-byte block number and a
> 2-byte item pointer offset.  So, there is clearly a need to go through
> the whole way we're doing this and rethink which parts are generic and
> which parts are AM-specific.
>
> 4. A related problem, which has been mentioned or at least alluded to
> by both Heikki and by me, is that we need a better way of handling the
> AM-specific data. Right now, the zheap code packs fixed-size things
> into the payload data and then finds them by knowing the offset where
> any particular data is within that field, but that's an unmaintainable
> mess.  The zheap code could be improved by at least defining those
> offsets as constants someplace and adding some comments explaining the
> payload formats of various undo records, but even if we do that, it's
> not going to generalize very well to anything more complicated than a
> few fixed-size bits of data.  I suggested using the pqformat stuff to
> try to struc

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund
Hi,

On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> We can not know the complete size of the record even by reading the
> header because we have a payload that is variable part and payload
> length are stored in the payload header which again can be at random
> offset.

Wait, but that's just purely self inflicted damage, no? The initial
length just needs to include the payload. And all this is not an issue
anymore?

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund
Hi,

On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > We can not know the complete size of the record even by reading the
> > > header because we have a payload that is variable part and payload
> > > length are stored in the payload header which again can be at random
> > > offset.
> >
> > Wait, but that's just purely self inflicted damage, no? The initial
> > length just needs to include the payload. And all this is not an issue
> > anymore?
> >
> Actually, we store the undo length only at the end of the record and
> that is for traversing the transaction's undo record chain during bulk
> fetch.  Ac such in the beginning of the record we don't have the undo
> length.  We do have uur_info but that just tell us which all optional
> header are included in the record.

But why? It makes a *lot* more sense to have it in the beginning. I
don't think bulk-fetch really requires it to be in the end - we can
still process records forward on a page-by-page basis.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > We can not know the complete size of the record even by reading the
> > header because we have a payload that is variable part and payload
> > length are stored in the payload header which again can be at random
> > offset.
>
> Wait, but that's just purely self inflicted damage, no? The initial
> length just needs to include the payload. And all this is not an issue
> anymore?
>
Actually, we store the undo length only at the end of the record and
that is for traversing the transaction's undo record chain during bulk
fetch.  Ac such in the beginning of the record we don't have the undo
length.  We do have uur_info but that just tell us which all optional
header are included in the record.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > We can not know the complete size of the record even by reading the
> > > > header because we have a payload that is variable part and payload
> > > > length are stored in the payload header which again can be at random
> > > > offset.
> > >
> > > Wait, but that's just purely self inflicted damage, no? The initial
> > > length just needs to include the payload. And all this is not an issue
> > > anymore?
> > >
> > Actually, we store the undo length only at the end of the record and
> > that is for traversing the transaction's undo record chain during bulk
> > fetch.  Ac such in the beginning of the record we don't have the undo
> > length.  We do have uur_info but that just tell us which all optional
> > header are included in the record.
>
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

Yeah, we can handle the bulk fetch as you suggested and it will make
it a lot easier.  But, currently while registering the undo request
(especially during the first pass) we need to compute the from_urecptr
and the to_urecptr. And,  for computing the from_urecptr,  we have the
end location of the transaction because we have the uur_next in the
transaction header and that will tell us the end of our transaction
but we still don't know the undo record pointer of the last record of
the transaction.  As of know, we read previous 2 bytes from the end of
the transaction to know the length of the last record and from there
we can compute the undo record pointer of the last record and that is
our from_urecptr.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-21 Thread Michael Paquier
On Thu, Aug 22, 2019 at 12:36:10AM +0900, Masahiko Sawada wrote:
> I will update the patch and register to the next Commit Fest tomorrow
> if nobody is interested in.

Thanks, Sawada-san.
--
Michael


signature.asc
Description: PGP signature


Re: Optimization of vacuum for logical replication

2019-08-21 Thread Kyotaro Horiguchi
Hello.

At Wed, 21 Aug 2019 18:06:52 +0300, Konstantin Knizhnik 
 wrote in 
<968fc591-51d3-fd74-8a55-40aa770ba...@postgrespro.ru>
> Ok, you convinced me that there are cases when people want to combine
> logical replication with streaming replication without slot.
> But is it acceptable to have GUC variable (disabled by default) which
> allows to use this optimizations?

The odds are quite high. Couldn't we introduce a new wal_level
value instead?

wal_level = logical_only


I think this thread shows that logical replication no longer is a
superset(?) of physical replication.  I thougt that we might be
able to change wal_level from scalar to bitmap but it breaks
backward compatibility..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: range_agg

2019-08-21 Thread Paul A Jungwirth
On Tue, Aug 20, 2019 at 10:33 PM Jeff Davis  wrote:
> > Is there any historical discussion around
> > typemods on range types?
>
> I did find a few references:

Thanks for looking those up! It's very interesting to see some of the
original discussion around range types.

Btw this is true of so much, isn't it?:

> It's more a property of the
> column than the type.

Sometimes I think about having a maybe type instead of null/not
null. SQL nulls are already very "monadic" I think but nullability
doesn't travel. I feel like someone ought to write a paper about that,
but I don't know of any. This is tantalizingly close (and a fun read)
but really about something else:
https://www.researchgate.net/publication/266657590_Incomplete_data_what_went_wrong_and_how_to_fix_it
Since you're getting into Rust maybe you can update the wiki page
mentioned in those threads about refactoring the type system. :-)
Anyway sorry for the digression. . . .

> So, I wouldn't spend a lot of time on typmod for multiranges.

Okay, thanks! There is plenty else to do. I think I'm already
supporting it as much as range types do.

Btw I have working multirange_{send,recv,in,out} now, and I
automatically create a multirange type and its array type when someone
creates a new range type. I have a decent start on passing tests and
no compiler warnings. I also have a start on anymultirange and
anyrangearray. (I think I need the latter to support a range-like
constructor function, so you can say `int4multirange(int4range(1,4),
int4range(8,10))`.) I want to get the any* types done and improve the
test coverage, and then I'll probably be ready to share a patch.

Here are a couple other questions:

- Does anyone have advice for the typanalyze function? I feel pretty
out of my depth there (although I haven't looked into typanalyze stuff
very deeply yet). I can probably get some inspiration from
range_typanalyze and array_typanalyze, but those are both quite
detailed (their statistics functions that is).

- What should a multirange do if you give it an empty range? I'm
thinking it should just ignore it, but then `'{}'::int4multirange =
'{empty}'::int4multirange`. Does that seem okay? (It does to me
actually, if we think of `empty` as the additive identity. Similarly
mr + empty = mr.

- What should a multirange do if you give it a null, like
`int4multirange(int4range(1,4), null)`. I'm thinking it should be
null, just like mr + null = null. Right?

Thanks!
Paul




[PATCH] Tab completion for CREATE OR REPLACE

2019-08-21 Thread Wang, Shenhao
Hello, hackers:

I created a patch about tab completion for command CREATE OR REPLACE in psql
includes:
CREATE [ OR REPLACE ] FUNCTION
CREATE [ OR REPLACE ] PROCEDURE
CREATE [ OR REPLACE ] LANGUAGE
CREATE [ OR REPLACE ] RULE name AS ON event
CREATE [ OR REPLACE ] VIEW AS SELECT
CREATE [ OR REPLACE ] AGGREGATE
CREATE [ OR REPLACE ] TRANSFORM

--
Regards
Shenhao Wang




0001-Tab-completion-of-CREATE-OR-REPLACE-in-psql.patch
Description: 0001-Tab-completion-of-CREATE-OR-REPLACE-in-psql.patch


Re: [PATCH] Tab completion for CREATE OR REPLACE

2019-08-21 Thread Ian Barwick
On Thu, 22 Aug 2019 at 15:05, Wang, Shenhao  wrote:
>
> Hello, hackers:
>
> I created a patch about tab completion for command CREATE OR REPLACE in psql
> includes:
> CREATE [ OR REPLACE ] FUNCTION
> CREATE [ OR REPLACE ] PROCEDURE
> CREATE [ OR REPLACE ] LANGUAGE
> CREATE [ OR REPLACE ] RULE name AS ON event
> CREATE [ OR REPLACE ] VIEW AS SELECT
> CREATE [ OR REPLACE ] AGGREGATE
> CREATE [ OR REPLACE ] TRANSFORM
>
> --

Could you add this to the next commitfest?

https://commitfest.postgresql.org/24/

Regards

Ian Barwick

--
  Ian Barwick   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services