date:20201219

Re: Asynchronous Append on postgres_fdw nodes.

2020-12-19 Thread Etsuro Fujita

On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
 wrote:
> At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita  
> wrote in
> > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> >  wrote:
> > > The reason for
> > > the early fetching is letting fdw send the next request as early as
> > > possible. (However, I didn't measure the effect of the
> > > nodeAppend-level prefetching.)
> >
> > I agree that that would lead to an improved efficiency in some cases,
> > but I still think that that would be useless in some other cases like
> > SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
> > would get worse if we support Append on top of joins or aggregates
> > over ForeignScans, which would be more expensive to perform than these
> > ForeignScans.
>
> I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
> for many times is more common than fetching all or "LIMIT  multiples of fetch_size>", that discussion would be convincing... Is
> it really the case?

I don't have a clear answer for that...  Performance in the case you
mentioned would be improved by async execution without prefetching by
Append, so it seemed reasonable to me to remove that prefetching to
avoid unnecessary overheads in the case I mentioned.  BUT: I started
to think my proposal, which needs an additional FDW callback routine
(ie, ForeignAsyncBegin()), might be a bad idea, because it would
increase the burden on FDW authors.

> > If we do prefetching, I think it would be better that it’s the
> > responsibility of the FDW to do prefetching, and I think that that
> > could be done by letting the FDW to start another data fetch,
> > independently of the core, in the ForeignAsyncNotify callback routine,
>
> FDW does prefetching (if it means sending request to remote) in my
> patch, so I agree to that.  It suspect that you were intended to say
> the opposite.  The core (ExecAppendAsyncGetNext()) controls
> prefetching in your patch.

No.  That function just tries to retrieve a tuple from any of the
ready subplans (ie, subplans marked as as_needrequest).

> > which I revived from Robert's original patch.  I think that that would
> > be more efficient, because the FDW would no longer need to wait until
> > all buffered tuples are returned to the core.  In the WIP patch, I
>
> I don't understand. My patch sends a prefetch-query as soon as all the
> tuples of the last remote-request is stored into FDW storage.  The
> reason for removing ExecAsyncNotify() was it is just redundant as far
> as concerning Append asynchrony. But I particulary oppose to revive
> the function.

Sorry, my explanation was not good, but what I'm saying here is about
my patch, not your patch.  I think this FDW callback routine would be
useful; it allows an FDW to perform another asynchronous data fetch
before delivering a tuple to the core as discussed in [1].  Also, it
would be useful when extending to the case where we have intermediate
nodes between an Append and a ForeignScan such as joins or aggregates,
which I'll explain below.

> > only allowed the callback routine to put the corresponding ForeignScan
> > node into a state where it’s either ready for a new request or needing
> > a callback for another data fetch, but I think we could probably relax
> > the restriction so that the ForeignScan node can be put into another
> > state where it’s ready for a new request while needing a callback for
> > the prefetch.
>
> I don't understand this, too. ExecAsyncNotify() doesn't touch any of
> the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
> your patch.  Am I looking into something wrong?  I'm looking
> async-wip-2020-11-17.patch.

In the WIP patch I post, these bitmaps are modified in the core side
based on the callback_pending and request_complete flags in
AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).

> (By the way, it is one of those that make the code hard to read to me
> that the "callback" means "calling an API function".  I think none of
> them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
> "callback".)

I thought the word “callback” was OK, because these functions would
call the corresponding FDW callback routines, but I’ll revise the
wording.

> > The reason why I disabled async execution when executing EPQ is to
> > avoid sending asynchronous queries to the remote sides, which would be
> > useless, because scan tuples for an EPQ recheck are obtained in a
> > dedicated way.
>
> If EPQ is performed onto Append, I think it should gain from
> asynchronous execution since it is going to fetch *a* tuple from
> several partitions or children.  I believe EPQ doesn't contain Append
> in major cases, though.  (Or I didn't come up with the steps for the
> case to happen...)

Sorry, I don’t understand this part.  Could you elaborate a bit more on it?

> > What do you mean by "push-up style executor"?
>
> The reverse of the volcano-style executor, which enters from the
> topmost node and down to the

Re: Asynchronous Append on postgres_fdw nodes.

2020-12-19 Thread Etsuro Fujita

On Mon, Dec 14, 2020 at 5:56 PM Kyotaro Horiguchi
 wrote:
> At Sat, 12 Dec 2020 19:06:51 +0900, Etsuro Fujita  
> wrote in
> > On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
> >  wrote:

> > > +   /* wait or poll async events */
> > > +   if (!bms_is_empty(node->as_asyncpending))
> > > +   {
> > > +   Assert(!node->as_syncdone);
> > > +   Assert(bms_is_empty(node->as_needrequest));
> > > +   ExecAppendAsyncEventWait(node);
> > >
> > > You moved the function to wait for events from execAsync to
> > > nodeAppend. The former is a generic module that can be used from any
> > > kind of executor nodes, but the latter is specialized for nodeAppend.
> > > In other words, the abstraction level is lowered here.  What is the
> > > reason for the change?
> >
> > The reason is just because that function is only called from
> > ExecAppend().  I put some functions only called from nodeAppend.c in
> > execAsync.c, though.
>
> (I think) You told me that you preferred the genericity of the
> original interface, but you're doing the opposite.  If you think we
> can move such a generic feature to a part of Append node, all other
> features can be move the same way.  I guess there's a reason you want
> only the this specific feature out of all of them be Append-spcific
> and I want to know that.

The reason is that I’m thinking to add a small feature for
multiplexing Append subplans, not a general feature for async
execution as discussed in [1], because this would be an interim
solution until the executor rewrite is done.

> > I think that when an async-aware node gets a tuple from an
> > async-capable node, they should use ExecAsyncRequest() /
> > ExecAyncHogeResponse() rather than ExecProcNode() [1].  I modified the
> > patch so that ExecAppendAsyncResponse() is called from Append, but to
> > support bubbling up the plan tree discussed in [2], I think it should
> > be called from ForeignScans (the sides of async-capable nodes).  Am I
> > right?  Anyway, I’ll rename ExecAppendAyncResponse() to the one
> > proposed in Robert’s original patch.
>
> Even though I understand the concept but to make work it we need to
> remember the parent *async* node somewhere. In my faint memory the
> very early patch did something like that.
>
> So I think just providing ExecAsyncResponse() doesn't make it
> true. But if we make it true, it would be something like
> partially-reversed steps from what the current Exec*()s do for some of
> the existing nodes and further code is required for some other nodes
> like WindowFunction. Bubbling up works only in very simple cases where
> a returned tuple is thrown up to further parent as-is or at least when
> the node convers a tuple into another shape. If an async-receiver node
> wants to process multiple tuples from a child or from multiple
> children, it is no longer be just a bubbling up.

I explained the meaning of “bubbling up the plan tree” in a previous
email I sent a moment ago.

> And.. I think the reason I feel uneasy for the patch may be that the
> patch uses the interface names in somewhat different context.
> Origianlly the fraemework resides in-between executor nodes, not on a
> node of either side. ExecAsyncNotify() notifies the requestee about an
> event and ExecAsyncResonse() notifies the requestor about a new
> tuple. I don't feel strangeness in this usage. But this patch feels to
> me using the same names in different (and somewhat wrong) context.

Sorry, this is a WIP patch.  Will fix.

> > > +   /* Perform the actual callback. */
> > > +   ExecAsyncNotify(areq);
> > >
> > > Mmm. The usage of the function (or its name) looks completely reverse
> > > to me.

> > As mentioned in a previous email, this is an FDW callback routine
> > revived from Robert’s patch.  I think the naming is reasonable,
> > because the callback routine notifies the FDW of readiness of a file
> > descriptor.  And actually, the callback routine tells the core whether
> > the corresponding ForeignScan node is ready for a new request or not,
> > by setting the callback_pending flag accordingly.
>
> Hmm. Agreed.  The word "callback" is also used there [3]...  I
> remember and it seems reasonable that the core calls AsyncNotify() on
> FDW and the FDW calls ExecForeignScan as a response to it and notify
> back to core of that using ExecAsyncRequestDone().  But the patch here
> feels a little strange, or uneasy, to me.

I’m not sure what I should do to improve the patch.  Could you
elaborate a bit more on this part?

> > > postgres_fdw.c
> > >
> > > > postgresIterateForeignScan(ForeignScanState *node)
> > > > {
> > > >   PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
> > > >   TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
> > > >
> > > >   /*
> > > >* If this is the first call after Begin or ReScan, we need to 
> > > > create the
> > > >

Re: Proposed patch for key managment

2020-12-19 Thread Alastair Turner

Hi Stephen

On Fri, 18 Dec 2020 at 21:36, Stephen Frost  wrote:
>
> Greetings Alastair,
>
> * Alastair Turner (min...@decodable.me) wrote:
> > On Wed, 16 Dec 2020 at 22:43, Stephen Frost  wrote:
...
> > passphrase key wrapper, the secret store and the cloud/HW KMS.
> >
> > Since the examples expand the purpose of cluster_passphrase_command,
> > let's call it cluster_key_challenge_command in the examples.
>
> These ideas don't seem too bad but I'd think we'd pass which key we want
> on the command-line using a %i or something like that, rather than using
> stdin, unless there's some reason that would be an issue..?  Certainly
> the CLI utilities I've seen tend to expect the name of the secret that
> you're asking for to be passed on the command line.

Agreed. I was working on the assumption that the calling process
(initdb or pg_ctl) would have access to the challenge material and be
passing it to the utility, so putting it in a command line would allow
it to leak. If the utility is accessing the challenge material
directly, then just an indicator of which key to work with would be a
lot simpler, true.

>
> > Starting with the passphrase key wrapper, since it's what's in place now.
> >
> >  - cluster_key_challenge_command = 'password_key_wrapper %q'
>
> I do tend to like this idea of having what
> cluster_key_challenge_command, or whatever we call it, expects is an
> actual key and have the command that is run be a separate command that
> takes the passphrase and runs the KDF (key derivation function) on it,
> when a passphrase is what the user wishes to use.
>
> That generally makes more sense to me than having the key derivation
> effort built into the backend which I have a hard time seeing any
> particular reason for, as long we're already calling out to some
> external utility of some kind to get the key.
>
...
>
> With the thought of trying to keep it a reasonably simple interface, I
> had a thought along these lines:
>
> - Separate command that runs the KDF, this is simple enough as it's
>   really just a hash, and it returns the key on stdout.
> - initdb time option that says if we're going to have PG manage the
>   sub-keys, or not.
> - cluster_key_command defined as "a command that is passed the ID of
>   the key, or keys, required for the cluster to start"
>
> initdb --pg-subkeys
>   - Calls cluster_key_command once with "$PG_SYSTEM_ID-main" or similar
> and expects the main key to be provided on stdout.  Everything else
> is then more-or-less as is today: PG generates DEK sub-keys for data
> and WAL and then encrypts them with the main key and stores them.
>
> As long as the enveloped keys file exists on the filesystem, when PG
> starts, it'll call the cluster_key_command and will expect the 'main'
> key to be provided and it'll then decrypt and verify the DEK sub-keys,
> very similar to today.
>
> In this scenario, cluster_key_command might be set to call a command
> which accepts a passphrase and runs a KDF on it, or it might be set to
> call out to an external vaulting system or to a Yubikey, etc.
>
> initdb --no-pg-subkeys
>   - Calls cluster_key_command for each of the sub-keys that "pg-subkeys"
> would normally generate itself, passing "$PG_SYSTEM_ID-keyid" for
> each (eg: $PG_SYSTEM_ID-data, $PG_SYSTEM_ID-wal), and getting back
> the keys on stdout to use.
>
> When PG starts, it sees that the enveloped keys file doesn't exist and
> does the same as initdb did- calls cluster_key_command multiple times to
> get the keys which are needed.  We'd want to make sure to have a way
> early on that checks that the data DEK provided actually decrypts the
> cluster, and bail out otherwise, before actually writing any data with
> that key.  I'll note though that this approach would actually allow you
> to change the WAL key, if you wanted to, though, which could certainly
> be nice (naturally there would be various caveats about doing so and
> that replicas would have to also be switched to the new key, etc, but
> that all seems reasonably solvable). Having a stand-alone utility that
> could do that for the --pg-subkeys case would be useful too (and just
> generally decrypt it for viewing/backup/replacement/etc).
>
> Down the line this might even allow us to do an online re-keying, at
> least once the challenges around enabling online data checksums are
> sorted out, but I don't think that's something to worry about today.
> Still, it seems like this potentially provides a pretty clean interface
> for that to happen eventually.
>

Changing the WAL key independently of the local storage key is the big
operational advantage I see of managing the two keys separately. It
lays the basis for a full key rotation story.

Rotating WAL keys during switchover between a pair with the new key
applicable from a certain timeline number maye be close enough to
online to satisfy most requirements. Not something to worry about
today, as you say.

>
> Right, I think we can agree on this aspect an

Re: Proposed patch for key managment

2020-12-19 Thread Alastair Turner

Hi Bruce

On Sat, 19 Dec 2020 at 02:38, Bruce Momjian  wrote:
>
> I am not going be as kind.  Our workflow is:
>
> Desirability -> Design -> Implement -> Test -> Review -> Commit
> https://wiki.postgresql.org/wiki/Todo#Development_Process
>
> I have already asked about the first item, and received a reply talking
> about the design --- that is not helpful.  I only have so much patience
> for the "I want my secret decoder ring to glow in the dark" type of
> feature additions to this already complex feature.  Unless we stay on
> Desirability, I will not be replying.  If you can't answer the
> Desirability question, well, talking about items farther right on the
> list is not very helpful.
>
> Now, I will say that your question about how a KMS will use this got me
> thinking about how to test this, and that got me to implement the AWS
> Secret Manager script, so that we definitely helpful.  My point is that
> I don't think it is helpful to get into long discussions unless the
> Desirability is clearly answered.  This is not just related to this
> thread --- this kind of jump-over-Desirability has happened a lot in
> relation to this feature, so I thought it would be good to clearly state
> it now.
>

Sorry, I have waved Desirability through under the headings of ease of
adoption or not raising barriers to adoption, without detailing what
barriers I see or how to avoid them. I also realise that "don't scare
the users" is so open-ended so as to be actively unhelpful and very
quickly starts to sound like "I want my secret decoder ring to glow
pink, or blue, or green when anyone asks". Given the complexity of the
feature and pixels spilled in discussing it, I understand that it gets
frustrating. That said, I believe there is an important case to be
made here.

In summary, I believe that forcing an approach for key generation and
wrapping onto users of Cluster File Encryption limits the Desirability
of the feature.

Cluster File Encryption for Postgres would be Desirable to many users
I meet if, and only if, the generation and handling of keys fits with
their corporate policies. Therefore, giving a user the option to pass
an encryption key to Postgres for CFE is Desirable. I realise that the
option for feeding the keys in directly is an additional feature, and
that it has a user experience impact if a passphrase is used. It is,
however, a feature which opens up near-infinite flexibility. To
stretch the analogy, it is the clip for attaching coloured or opaque
coverings to the glowy bit on the secret decoder ring.

The generation of keys and the wrapping of keys are contentious issues
and are frequently addressed/specified in corporate security policies,
standards and technical papers (NIST 800-38F is often mentioned in
product docs). There are, for instance, policies in the wild which
require that keys for long term use are generated from the output of a
True Random Number Generator - the practical implication being that
the key used to encrypt data at rest should be created by an HSM. When
and why to use symmetric or asymmetric cyphers for key wrapping is
another item which different organisations have different policies on
- the hardware and cloud service vendors all offer both options, even
if they recommend one and make it easier to use.

Thanks

Alastair

Re: Double partition lock in bufmgr

2020-12-19 Thread Konstantin Knizhnik




On 19.12.2020 10:53, Zhihong Yu wrote:

Hi,
w.r.t. the code in BufferAlloc(), the pointers are compared.

Should we instead compare the tranche Id of the two LWLock ?

Cheers


As far as LWlocks are stored in the array, comparing indexes in this 
array (tranche Id) is equivalent to comparing element's pointers.

So I do not see any problem here.

Just as experiment I tried a version of BufferAlloc without double 
locking (patch is attached).
I am not absolutely sure that my patch is correct: my main intention was 
to estimate influence of this buffer reassignment on performance.
I just run standard pgbench for database with scale 100 and default 
shared buffers size (256Mb). So there are should be a lot of page 
replacements.

I do not see any noticeable difference:

vanilla: 13087.596845
patch:   13184.442130

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9..91eb93d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -70,6 +70,8 @@
 
 #define RELS_BSEARCH_THRESHOLD		20
 
+#define NO_DOUBLE_BUFFER_LOCK   1
+
 typedef struct PrivateRefCountEntry
 {
 	Buffer		buffer;
@@ -197,6 +199,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -1093,12 +1096,22 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
+#if NO_DOUBLE_BUFFER_LOCK
+		if (buf_state & BM_TAG_VALID)
+		{
+			InvalidateBuffer(buf);	/* releases spinlock */
+			continue;
+		}
+#endif
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
+#if NO_DOUBLE_BUFFER_LOCK
+		Assert(!(oldFlags & (BM_DIRTY|BM_TAG_VALID)));
+#else
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1216,6 +1229,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			}
 		}
 		else
+#endif
 		{
 			/* if it wasn't valid, we need only the new partition */
 			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);

Re: multi-install PostgresNode

2020-12-19 Thread Andrew Dunstan


On 12/17/20 7:55 PM, Michael Paquier wrote:
> On Thu, Dec 17, 2020 at 04:37:54PM -0500, Andrew Dunstan wrote:
>> The proposed module would look something like this:
>>
>> [...]
>>
>> use parent PostgresNode;
>>
>> sub get_new_node
>> {
>>     my $installpath= shift;
>>     my $node = PostgresNode::get_new_node(@_);
>>     bless $node; # re-bless into current class
>>     $node->{_installpath} = $installpath;
>>     return $node;
>> }
> Passing down the installpath as argument and saving it within a
> PostgresNode or child class looks like the correct way of doing things
> to me.  This would require an extra routine to be able to get the
> install path from a node as _installpath would remain internal to the
> module file, right?  Shouldn't it be something that ought to be
> directly part of PostgresNode actually, where we could enforce the lib
> and bin paths to the output of pg_config if an _installpath is not
> provided by the caller?  In short, I am not sure that we need an extra
> module here.
>
>> and then  for each class method in PostgresNode.pm we'd have an override
>> something like:
>>
>> sub foo
>> {
>>     my $node=shift;
>>     my $inst = $node->{_installpath};
>>     local %ENV = %ENV;
>>     $ENV{PATH} = "$inst/bin:$ENV{PATH}";
>>     $ENV{LD_LIBRARY_PATH} = "$inst/lib:$ENV{LD_LIBRARY_PATH}";
>>     $node->SUPER::foo(@_);
>> }
>>
>> There might be more elegant ways of doing this, but that's what I came
>> up with.
> As long as it does not become necessary to pass down _installpath to
> all indidivual binary calls we have in PostgresNode or the extra
> module, this gets a +1 from me.  So, if I am getting that right, the
> key point is the use of local %ENV here to make sure that PATH and
> LD_LIBRARY_PATH are only enforced when it comes to calls within
> PostgresNode.pm, right?  That's an elegant solution.  This is
> something I have wanted for a long time for pg_upgrade to be able to
> get rid of its test.sh.
>
>> My main question is: do we want something like this in the core code
>> (presumably in src/test/perl), or is it not of sufficiently general
>> interest? If it's wanted I'll submit a patch, probably for the March CF,
>> but January if I manage to get my running shoes on. If not, I'll put it
>> in the buildfarm code, but then any TAP tests that want it will likewise
>> need to live there.
> This facility gives us the possibility to clean up the test code of
> pg_upgrade and move it to a TAP test, so I'd say that it is worth
> having in the core code in the long-term.


This turns out to be remarkably short, with the use of a little eval magic.

Give the attached, this test program works just fine:

#!/bin/perl
use PostgresNodePath;
$ENV{PG_REGRESS} =
'/home/andrew/pgl/vpath.12/src/test/regress/pg_regress';
my $node = get_new_node('/home/andrew/pgl/inst.12.5711','blurfl');
print $node->info;
print $node->connstr(),"\n";
$node->init();

cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com



PostgresNodePath.pm
Description: Perl program

Re: Proposed patch for key managment

2020-12-19 Thread Bruce Momjian

On Sat, Dec 19, 2020 at 11:45:15AM +, Alastair Turner wrote:
> Sorry, I have waved Desirability through under the headings of ease of
> adoption or not raising barriers to adoption, without detailing what
> barriers I see or how to avoid them. I also realise that "don't scare
> the users" is so open-ended so as to be actively unhelpful and very
> quickly starts to sound like "I want my secret decoder ring to glow
> pink, or blue, or green when anyone asks". Given the complexity of the
> feature and pixels spilled in discussing it, I understand that it gets
> frustrating. That said, I believe there is an important case to be
> made here.

I am pleased you understood my feelings on this.  Our last big
discussion on this topic is here:

https://www.postgresql.org/message-id/flat/CA%2Bfd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD%3D54Zp6NBQTCQ%40mail.gmail.com

and it was so unproductive that we started having closed voice calls
every other Friday so we could discuss this without lots of "decoder
ring" ideas that had to be explained.  The result is our wiki page:

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

It has taken me a while to understand why this topic seems to almost
uniquely gravitate toward "decoder ring" discussion.

> In summary, I believe that forcing an approach for key generation and
> wrapping onto users of Cluster File Encryption limits the Desirability
> of the feature.
> 
> Cluster File Encryption for Postgres would be Desirable to many users
> I meet if, and only if, the generation and handling of keys fits with
> their corporate policies. Therefore, giving a user the option to pass
> an encryption key to Postgres for CFE is Desirable. I realise that the
> option for feeding the keys in directly is an additional feature, and
> that it has a user experience impact if a passphrase is used. It is,
> however, a feature which opens up near-infinite flexibility. To
> stretch the analogy, it is the clip for attaching coloured or opaque
> coverings to the glowy bit on the secret decoder ring.
> 
> The generation of keys and the wrapping of keys are contentious issues
> and are frequently addressed/specified in corporate security policies,
> standards and technical papers (NIST 800-38F is often mentioned in
> product docs). There are, for instance, policies in the wild which
> require that keys for long term use are generated from the output of a
> True Random Number Generator - the practical implication being that
> the key used to encrypt data at rest should be created by an HSM. When
> and why to use symmetric or asymmetric cyphers for key wrapping is
> another item which different organisations have different policies on
> - the hardware and cloud service vendors all offer both options, even
> if they recommend one and make it easier to use.

To enable the direct injection of keys into the server, we would need a
new command for this, since trying to make the passphrase command do
this will lead to unnecessary complexity.  The passphrase command should
do one thing, and you can't have it changing its behavior based on
parameters, since the parameters are for the script to process, not to
change behavior of the server.

If we wanted to do this, we would need a new command, and one of the
parameters would be %k or something that would identify the key number
we want to retrieve from the KMS.  Stephen pointed out that we could
still validate that key;  the key would not be stored wrapped in the
file system, but we could store an HMAC in the file system, and use that
for validation.

On other interesting approach would be to allow the server to call out
for a KMS when it needs to generate the initial data keys that are
wrapped by the passphrase;  this is in contrast to calling the KMS
everytime it needs the data keys.

We should create code for the general use-case, which is currently
passphrase-wrapped system-generated data keys, and then get feedback on
what else we need.  However, I should point out that the community is
somewhat immune to the "my company needs this" kind of argument without
logic to back it up, though sometimes I think this entire feature is in
that category.  The data keys are generated using this random value
code:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/port/pg_strong_random.c;hb=HEAD

so someone would have to explain why generating this remotely in a KMS
is superior, not just based on company policy.

My final point is that we can find ways to do what you are suggesting as
an addition to what we are adding now.  What we need is clear
justification of why these additional features are needed.  Requiring
the use of a true random number generator is a valid argument, but we
need to figure out, once this is implemented, who really wants that, and
how to implement it cleanly.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cu

RE: libpq @windows : leaked singlethread_lock makes AppVerifier unhappy

2020-12-19 Thread Першин Юрий Петрович

Hi
Here is my workaround (from unit_tests.dll DLL_PROCESS_DETACH):

//3. Destroy LIBPQ!static pthread_mutex_t singlethread_lock
-  327 HMODULE hLeakedLibPQ = 
::GetModuleHandleA("libpq.dll"); //libpq.dll v13.0.1.20323 
(https://ftp.postgresql.org/pub/odbc/versions/msi/psqlodbc_13_00_.zip)
-  328 if (hLeakedLibPQ) {
-  329 void 
**singlethread_lock_ptr = (void **)(((BYTE *)hLeakedLibPQ) +
-  330 #ifdef _WIN64
-  331 0x484b8
-  332 #else
-  333 0x3F26C
-  334 #endif //_WIN64
-  335 );
-  336 if 
(*singlethread_lock_ptr) {
-  337 
DeleteCriticalSection((LPCRITICAL_SECTION)(*singlethread_lock_ptr));
-  338 typedef 
void(*pthread_mutex_destroy)(void *mutex);
-  339 
pthread_mutex_destroy freemtx = 
(pthread_mutex_destroy)::GetProcAddress(hLeakedLibPQ, "PQfreemem");
-  340 assert(freemtx 
!= NULL);
-  341 if (freemtx) 
freemtx(*singlethread_lock_ptr);
-  342 }
-  343 }

v10 release notes for extended stats

2020-12-19 Thread Justin Pryzby

2017-03-24 [7b504eb28] Implement multivariate n-distinct coefficients
2017-04-05 [2686ee1b7] Collect and use multi-column dependency stats
2017-05-12 [bc085205c] Change CREATE STATISTICS syntax

The existing notes say:
|Add multi-column optimizer statistics to compute the correlation ratio and 
number of distinct values (Tomas Vondra, David Rowley, Álvaro Herrera)
|New commands are CREATE STATISTICS, ALTER STATISTICS, and DROP STATISTICS.
|This feature is helpful in estimating query memory usage and when combining 
the statistics from individual columns.

"correlation ratio" is referring to stxkind=d (dependencies), right ?  That's
very unclear.

"helpful in estimating query memory usage": I guess it means that this allows
the planner to correctly account for large vs small number of GROUP BY values,
but it sounds more like it's going to help a user to estimate memory use.

"when combining the statistics from individual columns." this is referring to
stxkind=d, handling correlated/redundant clauses, but it'd be hard for a user
to know that.

Also, maybe it should say "combining stats from columns OF THE SAME TABLE".

So I propose:
|Allow creation of multi-column statistics objects, for computing the
|dependencies between columns and number of distinct values of combinations of 
columns
|(Tomas Vondra, |David Rowley, Álvaro Herrera)
|The new commands are CREATE STATISTICS, ALTER STATISTICS, and DROP STATISTICS.
|Improved statistics allow the planner to generate better query plans with more 
accurate
|estimates of the row count and memory usage when grouping by multiple
|columns, and more accurate estimates of the row count if WHERE clauses apply
|to multiple columns and values of some columns are correlated with values of
|other columns.

Re: v10 release notes for extended stats

2020-12-19 Thread Bruce Momjian

On Sat, Dec 19, 2020 at 01:39:27PM -0600, Justin Pryzby wrote:
> 2017-03-24 [7b504eb28] Implement multivariate n-distinct coefficients
> 2017-04-05 [2686ee1b7] Collect and use multi-column dependency stats
> 2017-05-12 [bc085205c] Change CREATE STATISTICS syntax
> 
> The existing notes say:
> |Add multi-column optimizer statistics to compute the correlation ratio and 
> number of distinct values (Tomas Vondra, David Rowley, Álvaro Herrera)
> |New commands are CREATE STATISTICS, ALTER STATISTICS, and DROP STATISTICS.
> |This feature is helpful in estimating query memory usage and when combining 
> the statistics from individual columns.
> 
> "correlation ratio" is referring to stxkind=d (dependencies), right ?  That's
> very unclear.
> 
> "helpful in estimating query memory usage": I guess it means that this allows
> the planner to correctly account for large vs small number of GROUP BY values,
> but it sounds more like it's going to help a user to estimate memory use.
> 
> "when combining the statistics from individual columns." this is referring to
> stxkind=d, handling correlated/redundant clauses, but it'd be hard for a user
> to know that.
> 
> Also, maybe it should say "combining stats from columns OF THE SAME TABLE".
> 
> So I propose:
> |Allow creation of multi-column statistics objects, for computing the
> |dependencies between columns and number of distinct values of combinations 
> of columns
> |(Tomas Vondra, |David Rowley, Álvaro Herrera)
> |The new commands are CREATE STATISTICS, ALTER STATISTICS, and DROP 
> STATISTICS.
> |Improved statistics allow the planner to generate better query plans with 
> more accurate
> |estimates of the row count and memory usage when grouping by multiple
> |columns, and more accurate estimates of the row count if WHERE clauses apply
> |to multiple columns and values of some columns are correlated with values of
> |other columns.

Uh, at the time, that was the best text we could come up with.  We don't
usually go back to update them unless there is a very good reason, and I
am not seeing that above.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Proposed patch for key managment

2020-12-19 Thread Bruce Momjian

On Sat, Dec 19, 2020 at 11:58:37AM -0500, Bruce Momjian wrote:
> My final point is that we can find ways to do what you are suggesting as
> an addition to what we are adding now.  What we need is clear
> justification of why these additional features are needed.  Requiring
> the use of a true random number generator is a valid argument, but we
> need to figure out, once this is implemented, who really wants that, and
> how to implement it cleanly.

I added a comment to this script to explain how to generate a true
random passphrase, attached.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



pass_fd.sh
Description: Bourne shell script

Re: Weird special case in jsonb_concat()

2020-12-19 Thread Tom Lane

I wrote:
> However, further experimentation found a case that fails:
> regression=# select '3'::jsonb || '{}'::jsonb;
> ERROR:  invalid concatenation of jsonb objects
> I wonder what is the point of this weird exception, and whether
> whoever devised it can provide a concise explanation of what
> they think the full behavior of "jsonb || jsonb" is.  Why isn't
> '[3, {}]' a reasonable result here, if the cases above are OK?

Here is a proposed patch for that.  It turns out that the third
else-branch in IteratorConcat() already does the right thing, if
we just remove its restrictive else-condition and let it handle
everything except the two-objects and two-arrays cases.  But it
seemed to me that trying to handle both the object || array
and array || object cases in that one else-branch was poorly
thought out: only one line of code can actually be shared, and it
took several extra lines of infrastructure to support the sharing.
So I split those cases into separate else-branches.

This also addresses the inadequate documentation that was the
original complaint.

Thoughts?  Should we back-patch this?  The existing behavior
seems to me to be inconsistent enough to be arguably a bug,
but we've not had field complaints saying "this should work".

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index df29af6371..5f3c393a25 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14715,8 +14715,12 @@ table2-mapping


 Concatenates two jsonb values.
-Concatenating two objects generates an object with the union of their
+Concatenating two arrays generates an array containing all the
+elements of each input.  Concatenating two objects generates an
+object containing the union of their
 keys, taking the second object's value when there are duplicate keys.
+All other cases are treated by converting a non-array input into a
+single-element array, and then proceeding as for two arrays.
 Does not operate recursively: only the top-level array or object
 structure is merged.

@@ -14727,6 +14731,14 @@ table2-mapping

 '{"a": "b"}'::jsonb || '{"c": "d"}'::jsonb
 {"a": "b", "c": "d"}
+   
+   
+'[1, 2]'::jsonb || '3'::jsonb
+[1, 2, 3]
+   
+   
+'{"a": "b"}'::jsonb || '42'::jsonb
+[{"a": "b"}, 42]

   
 
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 7a25415078..5beade00f4 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -4690,11 +4690,14 @@ IteratorConcat(JsonbIterator **it1, JsonbIterator **it2,
 	rk2 = JsonbIteratorNext(it2, &v2, false);
 
 	/*
-	 * Both elements are objects.
+	 * JsonbIteratorNext reports raw scalars as if they were single-element
+	 * arrays; hence we only need consider "object" and "array" cases here.
 	 */
 	if (rk1 == WJB_BEGIN_OBJECT && rk2 == WJB_BEGIN_OBJECT)
 	{
 		/*
+		 * Both elements are objects.
+		 *
 		 * Append all the tokens from v1 to res, except last WJB_END_OBJECT
 		 * (because res will not be finished yet).
 		 */
@@ -4703,18 +4706,18 @@ IteratorConcat(JsonbIterator **it1, JsonbIterator **it2,
 			pushJsonbValue(state, r1, &v1);
 
 		/*
-		 * Append all the tokens from v2 to res, include last WJB_END_OBJECT
-		 * (the concatenation will be completed).
+		 * Append all the tokens from v2 to res, including last WJB_END_OBJECT
+		 * (the concatenation will be completed).  Any duplicate keys will
+		 * automatically override the value from the first object.
 		 */
 		while ((r2 = JsonbIteratorNext(it2, &v2, true)) != WJB_DONE)
 			res = pushJsonbValue(state, r2, r2 != WJB_END_OBJECT ? &v2 : NULL);
 	}
-
-	/*
-	 * Both elements are arrays (either can be scalar).
-	 */
 	else if (rk1 == WJB_BEGIN_ARRAY && rk2 == WJB_BEGIN_ARRAY)
 	{
+		/*
+		 * Both elements are arrays.
+		 */
 		pushJsonbValue(state, rk1, NULL);
 
 		while ((r1 = JsonbIteratorNext(it1, &v1, true)) != WJB_END_ARRAY)
@@ -4731,46 +4734,40 @@ IteratorConcat(JsonbIterator **it1, JsonbIterator **it2,
 
 		res = pushJsonbValue(state, WJB_END_ARRAY, NULL /* signal to sort */ );
 	}
-	/* have we got array || object or object || array? */
-	else if (((rk1 == WJB_BEGIN_ARRAY && !(*it1)->isScalar) && rk2 == WJB_BEGIN_OBJECT) ||
-			 (rk1 == WJB_BEGIN_OBJECT && (rk2 == WJB_BEGIN_ARRAY && !(*it2)->isScalar)))
+	else if (rk1 == WJB_BEGIN_OBJECT)
 	{
-		JsonbIterator **it_array = rk1 == WJB_BEGIN_ARRAY ? it1 : it2;
-		JsonbIterator **it_object = rk1 == WJB_BEGIN_OBJECT ? it1 : it2;
-		bool		prepend = (rk1 == WJB_BEGIN_OBJECT);
+		/*
+		 * We have object || array.
+		 */
+		Assert(rk2 == WJB_BEGIN_ARRAY);
 
 		pushJsonbValue(state, WJB_BEGIN_ARRAY, NULL);
 
-		if (prepend)
-		{
-			pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL);
-			while ((r1 = JsonbIteratorNext(it_object, &v1, true)) != WJB_DONE)
-pushJs

Re: v10 release notes for extended stats

2020-12-19 Thread Tom Lane

Bruce Momjian  writes:
> On Sat, Dec 19, 2020 at 01:39:27PM -0600, Justin Pryzby wrote:
>> So I propose:

> Uh, at the time, that was the best text we could come up with.  We don't
> usually go back to update them unless there is a very good reason, and I
> am not seeing that above.

Yeah, it's a couple years too late to be worth spending effort on
improving the v10 notes, I fear.  If there's text in the main
documentation that could be improved, that's a different story.

regards, tom lane

pg_preadv() and pg_pwritev()

2020-12-19 Thread Thomas Munro

Hello hackers,

I want to be able to do synchronous vectored file I/O, so I made
wrapper macros for preadv() and pwritev() with fallbacks for systems
that don't have them.  Following the precedent of the pg_pread() and
pg_pwrite() macros, the "pg_" prefix reflects a subtle contract
change: the fallback paths might have the side effect of changing the
file position.

They're non-standard system calls, but the BSDs and Linux have had
them for a long time, and for other systems we can use POSIX
readv()/writev() with an additional lseek().  The worst case is
Windows (and maybe our favourite antique Unix build farm animal?)
which has none of those things, so there is a further fallback to a
loop.  Windows does have ReadFileScatter() and WriteFileGather(), but
those only work for overlapped (= asynchronous), unbuffered, page
aligned access.  They'll very likely be useful for native AIO+DIO
support in the future, but don't fit the bill here.

This is part of a project to consolidate and offload I/O (about which
more soon), but seemed isolated enough to post separately and I guess
it could be independently useful.
From 4f4e26d0e2c023d62865365e3cb5ce1c89e63cb0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 26 Nov 2020 15:48:31 +1300
Subject: [PATCH 07/11] Add pg_preadv() and pg_pwritev().

Provide synchronous scatter/gather I/O routines.  These map to preadv(),
pwritev() with various fallbacks for systems that don't have them.
---
 configure  | 30 ++--
 configure.ac   |  9 +++--
 src/include/pg_config.h.in | 15 ++
 src/include/port.h | 37 ++
 src/port/Makefile  |  2 ++
 src/port/pread.c   | 41 --
 src/port/pwrite.c  | 41 --
 src/tools/msvc/Solution.pm |  5 +
 8 files changed, 146 insertions(+), 34 deletions(-)

diff --git a/configure b/configure
index cc2089e596..d25b8f2900 100755
--- a/configure
+++ b/configure
@@ -13199,7 +13199,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -15293,7 +15293,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pread preadv pstat pthread_is_threaded_np pwrite pwritev readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
@@ -15970,32 +15970,6 @@ esac
 
 fi
 
-ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread"
-if test "x$ac_cv_func_pread" = xyes; then :
-  $as_echo "#define HAVE_PREAD 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pread.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pread.$ac_objext"
- ;;
-esac
-
-fi
-
-ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
-if test "x$ac_cv_func_pwrite" = xyes; then :
-  $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pwrite.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pwrite.$ac_objext"
- ;;
-esac
-
-fi
-
 ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random"
 if test "x$ac_cv_func_random" = xyes; then :
   $as_echo "#define HAVE_RANDOM 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index c819aeab26..9c21e46ec8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1342,6 +1342,7 @@ AC_CHECK_HEADERS(m4_normalize([
 	sys/shm.h
 	sys/sockio.h
 	sys/tas.h
+	sys/uio.h
 	sys/un.h
 	termios.h
 	ucred.h
@@ -1671,9 +1672,14 @@ AC_CHECK_FUNCS(m4_normalize([
 	poll
 	posix_fallocate
 	ppoll
+	pread
+	preadv
 	pstat
 	pthread_is_threaded_np
+	pwrite
+	pwritev
 	readlink
+	readv
 	setproctitle
 	setproctitle_fast
 	setsid
@@ -1684,6 +1690,7 @@ AC_CHE

Re: pg_preadv() and pg_pwritev()

2020-12-19 Thread Tom Lane

Thomas Munro  writes:
> I want to be able to do synchronous vectored file I/O, so I made
> wrapper macros for preadv() and pwritev() with fallbacks for systems
> that don't have them.  Following the precedent of the pg_pread() and
> pg_pwrite() macros, the "pg_" prefix reflects a subtle contract
> change: the fallback paths might have the side effect of changing the
> file position.

In a quick look, seems OK with some nits:

1. port.h cannot assume that  has already been included;
nor do I want to fix that by including  there.  Do we
really need to define a fallback value of IOV_MAX?  If so,
maybe the answer is to put the replacement struct iovec and
IOV_MAX in some new header.

2. I'm not really that happy about loading  into
every compilation we do, which would be another reason for a
new specialized header that either includes  or
provides fallback definitions.

3. The patch as given won't prove anything except that the code
compiles.  Is it worth fixing at least one code path to make
use of pg_preadv and pg_pwritev, so we can make sure this code
is tested before there's layers of other new code on top?

regards, tom lane

Re: Deleting older versions in unique indexes to avoid page splits

2020-12-19 Thread Peter Geoghegan

On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan  wrote:
> Most of the real changes in v11 (compared to v10) are in heapam.c.
> I've completely replaced the table_compute_xid_horizon_for_tuples()
> interface with a new interface that supports all existing requirements
> (from index deletions that support LP_DEAD deletion), while also
> supporting these new requirements (e.g. bottom-up index deletion).

I came up with a microbenchmark that is designed to give some general
sense of how helpful it is to include "extra" TIDs alongside
LP_DEAD-in-index TIDs (when they happen to point to the same table
block as the LP_DEAD-in-index TIDs, which we'll have to visit anyway).
The basic idea is simple: Add custom instrumentation that summarizes
each B-Tree index deletion in LOG output, and then run the regression
tests. Attached in the result of this for a "make installcheck-world".

If you're just looking at this thread for the first time in a while:
note that what I'm about to go into isn't technically bottom-up
deletion (though you will see some of that in the full log output if
you look). It's a closely related optimization that only appeared in
recent versions of the patch. So I'm comparing the current approach to
simple deletion of LP_DEAD-marked index tuples to a new enhanced
approach (that makes it a little more like bottom-up deletion, but
only a little).

Here is some sample output (selected almost at random from a text file
consisting of 889 lines of similar output):

... exact TIDs deleted 17, LP_DEAD tuples 4, LP_DEAD-related table blocks 2 )
... exact TIDs deleted 38, LP_DEAD tuples 28, LP_DEAD-related table blocks 1 )
... exact TIDs deleted 39, LP_DEAD tuples 1, LP_DEAD-related table blocks 1 )
... exact TIDs deleted 44, LP_DEAD tuples 42, LP_DEAD-related table blocks 3 )
... exact TIDs deleted 6, LP_DEAD tuples 2, LP_DEAD-related table blocks 2 )

(The initial contents of each line were snipped here, to focus on the
relevant metrics.)

Here we see that the actual number of TIDs/index tuples deleted often
*vastly* exceeds the number of LP_DEAD-marked tuples (which are all we
would have been able to delete with the existing approach of just
deleting LP_DEAD items). It's pretty rare for us to fail to at least
delete a couple of extra TIDs. Clearly this technique is broadly
effective, because in practice there are significant locality-related
effects that we can benefit from. It doesn't really matter that it's
hard to precisely describe all of these locality effects. IMV the
question that really matters is whether or not the cost of trying is
consistently very low (relative to the cost of our current approach to
simple LP_DEAD deletion). We do need to understand that fully.

It's tempting to think about this quantitatively (and it also bolsters
the case for the patch), but that misses the point. The right way to
think about this is as a *qualitative* thing. The presence of LP_DEAD
bits gives us certain reliable information about the nature of the
index tuple (that it is dead-to-all, and therefore safe to delete),
but it also *suggests* quite a lot more than that. In practice bloat
is usually highly correlated/concentrated, especially when we limit
our consideration to workloads where bloat is noticeable in any
practical sense. So we're very well advised to look for nearby
deletable index tuples in passing -- especially since it's practically
free to do so. (Which is what the patch does here, of course.)

Let's compare this to an extreme variant of my patch that runs the
same test, to see what changes. Consider a variant that exhaustively
checks every index tuple on the page at the point of a simple LP_DEAD
deletion operation, no matter what. Rather than only including those
TIDs that happen to be on the same heap/table blocks (and thus are
practically free to check), we include all of them. This design isn't
acceptable in the real world because it does a lot of unnecessary I/O,
but that shouldn't invalidate any comparison I make here. This is
still a reasonable approximation of a version of the patch with
magical foreknowledge of where to find dead TIDs. It's an Oracle
(ahem) that we can sensibly compare to the real patch within certain
constraints.

The results of this comparative analysis seem to suggest something
important about the general nature of what's going on. The results
are: There are only 844 deletion operations total with the Oracle.
While this is less than the actual patch's 889 deletion operations,
you would expect a bigger improvement from using what is after all
supposed to apply magic! This suggests to me that the precise
intervention of the patch here (the new LP_DEAD deletion stuff) has an
outsized impact. The correlations that naturally exist make this
asymmetrical payoff-to-cost situation possible. Simple cheap
heuristics once again go a surprisingly long way, kind of like
bottom-up deletion itself.

--
Peter Geoghegan

delete_stats_regression_tests.txt.gz
Description: application/gz

Re: pg_preadv() and pg_pwritev()

2020-12-19 Thread Thomas Munro

On Sun, Dec 20, 2020 at 12:34 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > I want to be able to do synchronous vectored file I/O, so I made
> > wrapper macros for preadv() and pwritev() with fallbacks for systems
> > that don't have them.  Following the precedent of the pg_pread() and
> > pg_pwrite() macros, the "pg_" prefix reflects a subtle contract
> > change: the fallback paths might have the side effect of changing the
> > file position.
>
> In a quick look, seems OK with some nits:

Thanks for looking!

> 1. port.h cannot assume that  has already been included;
> nor do I want to fix that by including  there.  Do we
> really need to define a fallback value of IOV_MAX?  If so,
> maybe the answer is to put the replacement struct iovec and
> IOV_MAX in some new header.

Ok, I moved all this stuff into port/pg_uio.h.

> 2. I'm not really that happy about loading  into
> every compilation we do, which would be another reason for a
> new specialized header that either includes  or
> provides fallback definitions.

Ack.

> 3. The patch as given won't prove anything except that the code
> compiles.  Is it worth fixing at least one code path to make
> use of pg_preadv and pg_pwritev, so we can make sure this code
> is tested before there's layers of other new code on top?

OK, here's a patch to zero-fill fresh WAL segments with pwritev().

I'm drawing a blank on trivial candidate uses for preadv(), without
infrastructure from later patches.
From d7178f296642e177bc57fabe93abec395cbaac5f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 26 Nov 2020 15:48:31 +1300
Subject: [PATCH v2 1/2] Add pg_preadv() and pg_pwritev().

Provide synchronous scatter/gather I/O routines.  These map to preadv(),
pwritev() with various fallbacks for systems that don't have them.

Reviewed-by: Tom Lane 
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
---
 configure  | 30 ++
 configure.ac   |  9 +--
 src/include/pg_config.h.in | 15 +++
 src/include/port.h |  2 ++
 src/include/port/pg_uio.h  | 51 ++
 src/port/Makefile  |  2 ++
 src/port/pread.c   | 43 ++--
 src/port/pwrite.c  | 43 ++--
 src/tools/msvc/Solution.pm |  5 
 9 files changed, 166 insertions(+), 34 deletions(-)
 create mode 100644 src/include/port/pg_uio.h

diff --git a/configure b/configure
index 11a4284e5b..5887c712cc 100755
--- a/configure
+++ b/configure
@@ -13061,7 +13061,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -15155,7 +15155,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pread preadv pstat pthread_is_threaded_np pwrite pwritev readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
@@ -15832,32 +15832,6 @@ esac
 
 fi
 
-ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread"
-if test "x$ac_cv_func_pread" = xyes; then :
-  $as_echo "#define HAVE_PREAD 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pread.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pread.$ac_objext"
- ;;
-esac
-
-fi
-
-ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
-if test "x$ac_cv_func_pwrite" = xyes; then :
-  $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pwrite.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pwrite.$ac_objext"
- ;;
-esac
-
-fi
-
 ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random"
 if test "x$ac_cv_func_random" = xyes; then :
   $as_echo

Re: [PATCH] Logical decoding of TRUNCATE

2020-12-19 Thread Noah Misch

On Sat, Apr 07, 2018 at 07:40:11PM -0400, Peter Eisentraut wrote:
> Committed with those changes.

Since commit 039eb6e added logical replication support for TRUNCATE, logical
apply of the TRUNCATE fails if it chooses a parallel index build:

cat >/tmp/most_parallel.conf <

Re: Weird special case in jsonb_concat()

2020-12-19 Thread Pavel Stehule

so 19. 12. 2020 v 21:35 odesílatel Tom Lane  napsal:

> I wrote:
> > However, further experimentation found a case that fails:
> > regression=# select '3'::jsonb || '{}'::jsonb;
> > ERROR:  invalid concatenation of jsonb objects
> > I wonder what is the point of this weird exception, and whether
> > whoever devised it can provide a concise explanation of what
> > they think the full behavior of "jsonb || jsonb" is.  Why isn't
> > '[3, {}]' a reasonable result here, if the cases above are OK?
>
> Here is a proposed patch for that.  It turns out that the third
> else-branch in IteratorConcat() already does the right thing, if
> we just remove its restrictive else-condition and let it handle
> everything except the two-objects and two-arrays cases.  But it
> seemed to me that trying to handle both the object || array
> and array || object cases in that one else-branch was poorly
> thought out: only one line of code can actually be shared, and it
> took several extra lines of infrastructure to support the sharing.
> So I split those cases into separate else-branches.
>
> This also addresses the inadequate documentation that was the
> original complaint.
>
> Thoughts?  Should we back-patch this?  The existing behavior
> seems to me to be inconsistent enough to be arguably a bug,
> but we've not had field complaints saying "this should work".
>

+1

Pavel


> regards, tom lane
>
>

Re: On login trigger: take three

2020-12-19 Thread Pavel Stehule

čt 17. 12. 2020 v 19:30 odesílatel Pavel Stehule 
napsal:

>
>
> čt 17. 12. 2020 v 14:04 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>>
>>
>> On 17.12.2020 9:31, Pavel Stehule wrote:
>>
>>
>>
>> st 16. 12. 2020 v 20:38 odesílatel Pavel Stehule 
>> napsal:
>>
>>> Attached please find new versoin of the patch based on
>>> on_connect_event_trigger_WITH_SUGGESTED_UPDATES.patch
>>>
 So there is still only  "disable_client_connection_trigger" GUC?
 because we need possibility to disable client connect triggers and there is
 no such need for other event types.

 As you suggested  have added "dathaslogontriggers" flag which is set
 when client connection trigger is created.
 This flag is never cleaned (to avoid visibility issues mentioned in my
 previous mail).

>>>
>>> This is much better - I don't see any slowdown when logon trigger is not
>>> defined
>>>
>>> I did some benchmarks and looks so starting language handler is
>>> relatively expensive - it is about 25% and starting handler like event
>>> trigger has about 35%. But this problem can be solved later and elsewhere.
>>>
>>> I prefer the inverse form of disable_connection_trigger. Almost all GUC
>>> are in positive form - so enable_connection_triggger is better
>>>
>>> I don't think so current handling dathaslogontriggers is good for
>>> production usage. The protection against race condition can be solved by
>>> lock on pg_event_trigger
>>>
>>
>> I thought about it, and probably the counter of connect triggers will be
>> better there. The implementation will be simpler and more robust.
>>
>>
>>
>>
>> I prefer to implement different approach: unset dathaslogontriggers  flag
>> in event trigger itself when no triggers are returned by
>> EventTriggerCommonSetup.
>> I am using double checking to prevent race condition.
>> The main reason for this approach is that dropping of triggers is not
>> done in event_trigger.c: it is done by generic code for all resources.
>> It seems to be there is no right place where decrementing of trigger
>> counters can be inserted.
>> Also I prefer to keep all logon-trigger specific code in one file.
>>
>> Also, as you suggested, I renamed disable_connection_trigger to
>> enable_connection_trigger.
>>
>>
>> New version of the patch is attached.
>>
>
> yes, it can work
>

for complete review the new field in pg_doc should be documented

Regards

Pavel


> Regards
>
> Pavel
>
>
>
>
>>
>>
>>
>> --
>> Konstantin Knizhnik
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>>
>>

Re: pg_preadv() and pg_pwritev()

2020-12-19 Thread Tom Lane

Thomas Munro  writes:
> OK, here's a patch to zero-fill fresh WAL segments with pwritev().
> I'm drawing a blank on trivial candidate uses for preadv(), without
> infrastructure from later patches.

This looks OK to me.  I tried it on prairiedog (has writev and
pwrite but not pwritev) as well as gaur (has only writev).
They seem happy.

One minor thought is that in

+   struct iovec iov[Min(IOV_MAX, 1024)];   /* cap stack space */

it seems like pretty much every use of IOV_MAX would want some
similar cap.  Should we centralize that idea with, say,

#define PG_IOV_MAX  Min(IOV_MAX, 1024)

?  Or will the plausible cap vary across uses?

regards, tom lane

Re: Weird special case in jsonb_concat()

2020-12-19 Thread Joel Jacobson

On Sat, Dec 19, 2020, at 21:35, Tom Lane wrote:
>Here is a proposed patch for that.

I've tested the patch and "All 202 tests passed".

In addition, I've tested it on a json intensive project,
which passes all its own tests.

I haven't studied the jsonfuncs.c code in detail,
but the new code looks much cleaner, nice.

>This also addresses the inadequate documentation that was the
>original complaint.

Looks good.

In addition, to the user wondering how to append a json array-value "as is",
I think it would be useful to provide an example on how to do this
in the documentation.

I think there is a risk users will attempt much more fragile
hacks to achieve this, if we don't provide guidance
in the documentation.

Suggestion:

 '["a", "b"]'::jsonb || '["a", "d"]'::jsonb
 ["a", "b", "a", "d"]

+   
+'["a", "b"]'::jsonb || jsonb_build_array('["a", 
"d"]'::jsonb)
+["a", "b", ["a", "d"]]
+   

 '{"a": "b"}'::jsonb || '{"c": "d"}'::jsonb
 {"a": "b", "c": "d"}

> Thoughts?  Should we back-patch this?  The existing behavior
> seems to me to be inconsistent enough to be arguably a bug,
> but we've not had field complaints saying "this should work".

+1 back-patch, I think it's a bug.

Best regards,

Joel

Re: Asynchronous Append on postgres_fdw nodes.

Re: Asynchronous Append on postgres_fdw nodes.

Re: Proposed patch for key managment

Re: Proposed patch for key managment

Re: Double partition lock in bufmgr

Re: multi-install PostgresNode

Re: Proposed patch for key managment

RE: libpq @windows : leaked singlethread_lock makes AppVerifier unhappy

v10 release notes for extended stats

Re: v10 release notes for extended stats

Re: Proposed patch for key managment

Re: Weird special case in jsonb_concat()

Re: v10 release notes for extended stats

pg_preadv() and pg_pwritev()

Re: pg_preadv() and pg_pwritev()

Re: Deleting older versions in unique indexes to avoid page splits

Re: pg_preadv() and pg_pwritev()

Re: [PATCH] Logical decoding of TRUNCATE

Re: Weird special case in jsonb_concat()

Re: On login trigger: take three

Re: pg_preadv() and pg_pwritev()

Re: Weird special case in jsonb_concat()

22 matches

Site Navigation

Mail list logo

Footer information