Re: Postgres 11: Table Partitioning and Primary Keys

2019-07-10 Thread Michael Paquier
On Tue, Jul 09, 2019 at 06:59:59PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>>  
>>   
>>Unique constraints on partitioned tables (as well as primary keys)
>>must constrain all the partition key columns.  This limitation exists
>>because PostgreSQL can only enforce
>>uniqueness in each partition individually.
>>   
>>  
> 
>> I'm not really sure about the "must constrain" verbiage.  Is that really
>> comprehensible?
> 
> I think "must include" might be better.

+1.

>> which may not be the pinnacle of clarity, but took some time to craft
>> and I think is correct.  Also it doesn't mention primary keys
>> explicitly; maybe we should patch it by adding "(as well as a primary
>> key)" right after "a unique constraint".  Thoughts?
> 
> I'd leave that alone.  I don't think the parenthetical comment about
> primary keys in your new text is adding much either.

Agreed with not bothering about this block and not adding the
parenthetical comment.
--
Michael


signature.asc
Description: PGP signature


make -jn fails by requiring not-yet-generated include files.

2019-07-10 Thread Kyotaro Horiguchi
Sorry, the subject of the previous mail was wrong. I resend it
with the correct subject.

 
I found the following make's behavior is annoying (at dab81b9953).

make distclean
./configure 
make all

make -j4 clean all
relpath.c:21:10: fatal error: catalog/pg_tablespace_d.h: No such file or 
directory
 #include "catalog/pg_tablespace_d.h"

(-j is needed, this happnes for me by -j2)

Just fixing the Makefile for it reveals the next complainer.

I'm not sure that it's the right thing but make got quiet by
moving include to the top in SUBDIRS in src/Makefile.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-10 Thread Amit Kapila
On Wed, Jul 10, 2019 at 10:12 AM Michael Paquier  wrote:
>
> On Tue, Jul 09, 2019 at 11:54:29AM -0700, Melanie Plageman wrote:
> > It might be worth post-processing results files to ignore row ordering
> > in some cases to allow for easier comparison. Has this been proposed
> > in the past?
>
> Not that I recall.
>

It would be good if we can come up with something like that.  It will
be helpful for zheap, where in some cases we get different row
ordering due to in-place updates.  As of now, we try to add Order By
or do some extra magic to get consistent row ordering.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-10 Thread Michael Paquier
On Tue, Jul 09, 2019 at 10:48:49PM +0800, Paul Guo wrote:
> Yes, the patches changed Makefile so that pg_rewind and pg_basebackup could
> use some common code, but for Windows build, I'm not sure where are those
> window build files. Does anyone know about that? Thanks.

The VS scripts are located in src/tools/msvc/.  You will likely need
to tweak things like $frontend_extraincludes or variables in the same
area for this patch (please see Mkvcbuild.pm).
--
Michael


signature.asc
Description: PGP signature


Re: Contribution to Perldoc for TestLib module in Postgres

2019-07-10 Thread Michael Paquier
On Tue, Jul 09, 2019 at 03:16:01PM +0200, Daniel Gustafsson wrote:
> The v2 patch is somewhat confused as it has Windows carriage returns rather
> than newlines, so it replaces the entire file making the diff hard to read.  
> It
> also includes a copy of TestLib and the v1 patch and has a lot of whitespace
> noise.

Nobody can provide a clear review if the files are just fully
rewritten even based on a read of the patch.  Perhaps you are working
on Windows and forgot to configure core.autocrlf with "git config".
That could make your life easier.

I have switched the patch as "waiting on author" for now.
--
Michael


signature.asc
Description: PGP signature


Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-10 Thread Amit Langote
On Tue, Jul 9, 2019 at 6:21 AM Tom Lane  wrote:
>
> Amit Langote  writes:
> > [ parse-plan-memcxt_v2.patch ]
>
> I got around to looking at this finally.

Thanks for the review.

> I'm not at all happy with
> the fact that it's added a plantree copy step to the only execution
> path through exec_simple_query.  That's a very significant overhead,
> in many use-cases, to solve something that nobody had complained
> about for a couple of decades before now.  I don't see the need for
> any added copy step anyway.  The only reason you're doing it AFAICS
> is so you can release the per-statement context a bit earlier, which
> is a completely unnecessary optimization.  Just wait to release it
> till the bottom of the loop.

Ah, that makes sense.  I've removed the copying of plan tree and also
moved the temporary context deletion to the bottom of the loop.

> Also, creating/deleting the sub-context is in itself an added overhead
> that accomplishes exactly nothing in the typical case where there's
> not multiple statements.  I thought the idea was to do that only if
> there was more than one raw parsetree (or, maybe better, do it for
> all but the last parsetree).

That makes sense too.  I've made it (creation/deletion of the child
context) conditional on whether there are more than one queries to
plan.

> To show that this isn't an empty concern, I did a quick pgbench
> test.  Using a single-client select-only test ("pgbench -S -T 60"
> in an -s 10 database), I got these numbers in three trials with HEAD:
>
> tps = 9593.818478 (excluding connections establishing)
> tps = 9570.189163 (excluding connections establishing)
> tps = 9596.579038 (excluding connections establishing)
>
> and these numbers after applying the patch:
>
> tps = 9411.918165 (excluding connections establishing)
> tps = 9389.279079 (excluding connections establishing)
> tps = 9409.350175 (excluding connections establishing)
>
> That's about a 2% dropoff.

With the updated patch, here are the numbers on my machine (HEAD vs patch)

HEAD:

tps = 3586.233815 (excluding connections establishing)
tps = 3569.252542 (excluding connections establishing)
tps = 3559.027733 (excluding connections establishing)

Patched:

tps = 3586.988057 (excluding connections establishing)
tps = 3585.169589 (excluding connections establishing)
tps = 3526.437968 (excluding connections establishing)

A bit noisy but not much degradation.

Attached updated patch.  Thanks again.

Regards,
Amit


parse-plan-memcxt_v3.patch
Description: Binary data


Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-10 Thread Michael Paquier
On Wed, Jul 10, 2019 at 12:51:28PM +0530, Amit Kapila wrote:
> It would be good if we can come up with something like that.  It will
> be helpful for zheap, where in some cases we get different row
> ordering due to in-place updates.  As of now, we try to add Order By
> or do some extra magic to get consistent row ordering.

That was an issue for me as well when working with Postgres-XC when
the row ordering was not guaranteed depending on the number of nodes
(speaking of which Greenplum has the same issues, no?).  Adding ORDER
BY clauses to a set of tests may make sense, but then this may impact
the plans generated for some of them..
--
Michael


signature.asc
Description: PGP signature


Re: FETCH FIRST clause PERCENT option

2019-07-10 Thread Kyotaro Horiguchi
Hello.

At Wed, 10 Jul 2019 15:02:57 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190710.150257.260806103.horikyota@gmail.com>
> It is seen by a simpler test.
> 
> create table t as select a from generate_series(0, 9) a;
> analyze t;
> explain analyze select * from t order by a desc;
>  Execution Time: 116.613 ms
> explain analyze select * from t order by a desc fetch first 1 percent rows 
> only;
>  Execution Time: 158.458 ms
> explain analyze select * from t order by a desc fetch first 100 percent rows 
> only;
>  Execution Time: 364.442 ms
> 
> I didn't looked closer to the version. Fetching from tuplestore
> and returning all tuples costs 206ms and it is exceeding the cost
> of fething of the whole table and returning all tuples. I don't
> believe tuplestore that isn't splling out to disk is so slower
> than (cached) table access.
> 
> Other than that, we can rip the clause if it is 100%

As a more significant point, I found that the first query in the
aboves runs faster by about 10-18% on master(unpatched).

explain analyze select * from t order by a desc;
 Execution Time: 96.690 ms

But perf didn't give me useful information.

patched:

1185711.7065  postgres qsort_ssup
9026  8.9114  postgres ApplySortComparator
6443  6.3612  [vdso] (tgid:8388 range:0x7ffed49ed000-0x7ffed49eefff) [vdso] 
(tgid:8388 range:0x7ffed49ed000-0x7ffed49eefff)
5826  5.7520  postgres btint4fastcmp
4699  4.6393  no-vmlinux   /no-vmlinux
3451  3.4072  libc-2.17.so __memcpy_ssse3_back
3270  3.2285  postgres LogicalTapeWrite
2972  2.9343  postgres copytup_heap
2961  2.9234  postgres readtup_heap
2769  2.7338  postgres LogicalTapeRead
2457  2.4258  postgres GetMemoryChunkContext
2147  2.1197  postgres InstrStopNode
2021  1.9953  postgres heapgettup_pagemode
1583  1.5629  postgres writetup_heap
1555  1.5353  postgres tuplesort_gettuple_common
1508  1.4889  postgres AllocSetAlloc
...

master:

1293212.0168  postgres qsort_ssup
9491  8.8193  postgres ApplySortComparator
6705  6.2305  postgres btint4fastcmp
6557  6.0930  [vdso] (tgid:6341 range:0x7ffdd0315000-0x7ffdd0316fff) [vdso] 
(tgid:6341 range:0x7ffdd0315000-0x7ffdd0316fff)
4874  4.5291  no-vmlinux   /no-vmlinux
4059  3.7717  postgres readtup_heap
3707  3.4447  libc-2.17.so __memcpy_ssse3_back
3583  3.3294  postgres LogicalTapeWrite
3382  3.1427  postgres LogicalTapeRead
3001  2.7886  postgres copytup_heap
2522  2.3435  postgres GetMemoryChunkContext
2464  2.2896  postgres heapgettup_pagemode
2115  1.9653  postgres InstrStopNode
1847  1.7163  postgres tuplesort_gettuple_common
1652  1.5351  postgres writetup_heap
1565  1.4542  postgres AllocSetAlloc

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_receivewal documentation

2019-07-10 Thread Michael Paquier
On Wed, Jul 10, 2019 at 12:22:02AM +0200, Laurenz Albe wrote:
> Works for me.
> 
> Marked as "ready for committer".

Hmm.  synchronous_commit is user-settable, which means that it is
possible to enforce a value in the connection string doing the
connection.  Isn't that something we had better enforce directly in
the tool?  In this case what could be fixed is GetConnection() which
builds the connection string parameters.  One thing that we would need
to be careful about is that if the caller has provided a parameter for
"options" (which is plausible as wal_sender_timeout is user-settable
as of 12), then we need to make sure that the original value is
preserved, and that the enforced of synchronous_commit is appended.

Or, as you say, we just adjust the documentation.  However I would
recommend adding at least an example of connection string which uses
"options" if the server sets synchronous_commit to "remote_apply" in
this case.  Still it seems to me that we have ways to reduce the
confusion automatically.
--
Michael


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Antonin Houska
Joe Conway  wrote:

> On 7/8/19 6:04 PM, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> >> Uh, well, renaming the user was a big problem, but that is the only case
> >> I can think of.  I don't see that as an issue for block or WAL sequence
> >> numbers.  If we want to use a different nonce, we have to find a way to
> >> store it or look it up efficiently.  Considering the nonce size, I don't
> >> see how that is possible.
> > 
> > No, this also meant that, as an attacker, I *knew* the salt ahead of
> > time and therefore could build rainbow tables specifically for that
> > salt.  I could also use those *same* tables for any system where that
> > user had an account, even if they used different passwords on different
> > systems...
> > 
> > I appreciate that *some* of this might not be completely relevant for
> > the way a nonce is used in cryptography, but I'd be very surprised to
> > have a cryptographer tell me that a deterministic nonce didn't have
> > similar issues or didn't reduce the value of the nonce significantly.
> 
> I have worked side by side on projects with bona fide cryptographers and
> I can assure you that they recommended random nonces. Granted, that was
> in the early 2000s, but I don't think "modern cryptography" has changed
> that any more than "web scale" has made Postgres irrelevant in the
> intervening years.

I think that particular threads have to be considered.

> Related links:

> https://defuse.ca/cbcmodeiv.htm
> https://www.cryptofails.com/post/70059609995/crypto-noobs-1-initialization-vectors

The first one looks more in-depth than the other one, so I focused on it:

* "Statistical Correlations between IV and Plaintext"

My understanding is that predictability of the IV (in our implementation of
full-instance encryption [1] we derive the IV from RelFileNode combined with
block number) can reveal information about the first encryption block (16
bytes) of the page, i.e. part of the PageHeaderData structure. I don't think
this leaks any valuable data. And starting the 2nd block, the IV is not
predictable because it is cipher text of the previous block.

* "Chosen-Plaintext Attacks"

The question here is whether we expect the OS admin to have access to the
database. In [1] we currently don't (cloud, where DBA has no control over the
storage layer is the main use case), but if it appears to be the requirement,
I believe CBC-ESSIV mode [2] can fix the problem.

Anyway, I'm not sure if this kind of attack can reveal more information than
something about the first block of the page (the page header), since each of
the following blocks uses ciphertext of the previous block as the IV.

* "Altering the IV Before Decryption"

I don't think this attack needs special attention - page checksums should
reveal it.


[1] https://commitfest.postgresql.org/23/2104/
[2] 
https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_.28ESSIV.29

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-10 Thread Kyotaro Horiguchi
Hi,

At Wed, 10 Jul 2019 16:35:18 +0900, Amit Langote  
wrote in 
> On Tue, Jul 9, 2019 at 6:21 AM Tom Lane  wrote:
> >
> > Amit Langote  writes:
> > > [ parse-plan-memcxt_v2.patch ]
> >
> > I got around to looking at this finally.
> 
> Thanks for the review.
> 
> > I'm not at all happy with
> > the fact that it's added a plantree copy step to the only execution
> > path through exec_simple_query.  That's a very significant overhead,
> > in many use-cases, to solve something that nobody had complained
> > about for a couple of decades before now.  I don't see the need for
> > any added copy step anyway.  The only reason you're doing it AFAICS
> > is so you can release the per-statement context a bit earlier, which
> > is a completely unnecessary optimization.  Just wait to release it
> > till the bottom of the loop.
> 
> Ah, that makes sense.  I've removed the copying of plan tree and also
> moved the temporary context deletion to the bottom of the loop.

- * Switch to appropriate context for constructing querytrees (again,
- * these must outlive the execution context).
+ * Switch to appropriate context for constructing query and plan trees
+ * (again, these must outlive the execution context).  Normally, it's
+ * MessageContext, but if there are more queries to plan, we use a
+ * temporary child context that will be reset after executing this
+ * query.  We avoid that overhead of setting up a separate context
+ * for the common case of having just a single query.

Might be stupid, but I feel uneasy that "usually it must live in
MessageContxt, but not necessarily if there is succeeding
query".. *I* need more explanation why it is safe to use that
short-lived context.

> > Also, creating/deleting the sub-context is in itself an added overhead
> > that accomplishes exactly nothing in the typical case where there's
> > not multiple statements.  I thought the idea was to do that only if
> > there was more than one raw parsetree (or, maybe better, do it for
> > all but the last parsetree).
> 
> That makes sense too.  I've made it (creation/deletion of the child
> context) conditional on whether there are more than one queries to
> plan.
>
> > To show that this isn't an empty concern, I did a quick pgbench
> > test.  Using a single-client select-only test ("pgbench -S -T 60"
> > in an -s 10 database), I got these numbers in three trials with HEAD:
> >
> > tps = 9593.818478 (excluding connections establishing)
> > tps = 9570.189163 (excluding connections establishing)
> > tps = 9596.579038 (excluding connections establishing)
> >
> > and these numbers after applying the patch:
> >
> > tps = 9411.918165 (excluding connections establishing)
> > tps = 9389.279079 (excluding connections establishing)
> > tps = 9409.350175 (excluding connections establishing)
> >
> > That's about a 2% dropoff.
> 
> With the updated patch, here are the numbers on my machine (HEAD vs patch)
> 
> HEAD:
> 
> tps = 3586.233815 (excluding connections establishing)
> tps = 3569.252542 (excluding connections establishing)
> tps = 3559.027733 (excluding connections establishing)
> 
> Patched:
> 
> tps = 3586.988057 (excluding connections establishing)
> tps = 3585.169589 (excluding connections establishing)
> tps = 3526.437968 (excluding connections establishing)
> 
> A bit noisy but not much degradation.
> 
> Attached updated patch.  Thanks again.
> 
> Regards,
> Amit

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Antonin Houska
Tomas Vondra  wrote:

> On Tue, Jul 09, 2019 at 03:50:39PM -0400, Bruce Momjian wrote:
> >On Tue, Jul 9, 2019 at 02:09:38PM -0400, Joe Conway wrote:
> >> On 7/9/19 11:11 AM, Bruce Momjian wrote:
> >> > Good point about nonce and IV.  I wonder if running the nonce
> >> > through the cipher with the key makes it random enough to use as an
> >> > IV.
> >>
> >> Based on that NIST document it seems so.
> >>
> >> The trick will be to be 100% sure we never reuse a nonce that is used
> >> to produce the IV when using the same key.
> >>
> >> I think the potential to get that wrong (i.e. inadvertently reuse a
> >> nonce) would lead to using the second described method
> >>
> >>   "The second method is to generate a random data block using a
> >>   FIPS-approved random number generator."
> >>
> >> That method is what I am used to seeing. But with the second method
> >> we need to store the IV, with the first we could reproduce it if we
> >> select our initial nonce carefully.
> >>
> >> So thinking out loud, and perhaps you already said this Bruce, but I
> >> guess the input nonce used to generate the IV could be something like
> >> pg_class.oid and blocknum concatenated together with some delimiting
> >> character as long as we guarantee that we generate different keys in
> >> different databases. Then there would be no need to store the IV since
> >> we could reproduce it.
> >
> >Uh, yes, and no.  Yes, we can use the pg_class.oid (since it has to
> >be preserved by pg_upgrade anyway), and the page number.  However,
> >different databases can have the same pg_class.oid/page number
> >combination, so there would be duplication between databases.  Now, you
> >might say let's add the pg_database.oid, but unfortunately, because of
> >the way we file-system-copy files from one database to another during
> >database creation (it doesn't go through shared buffers), we can't use
> >pg_database.oid as part of the nonce.
> >
> >My only idea here is that we actually decrypt/re-encrypted pages as we
> >copy them at the file system level during database creation to match the
> >new pg_database.oid.  This would allow pg_database.oid in the nonce/IV.
> >(I think we will need to modify pg_upgrade to preserve pg_database.oid.)
> >
> >If the nonce/IV is 96 bits, then that is 12 bytes or 3 4-byte values.
> >pg_class.oid is 4 bytes, pg_database.oid is 4 bytes, and that leaves
> >4-bytes for the block number, which gets us to 32TB before the page
> >counter would overflow a 4-byte value, and our max table size is 32TB
> >anyway, so that all works.
> >
> 
> I don't think that works, because that'd mean we're encrypting the same
> page with the same nonce over and over, which means reusing the reuse
> (even if you hash/encrypt it). Or did I miss something?

I found out that it's wrong to use the same key (or (key, IV) pair) to encrypt
different plain texts [1], however this is about *stream cipher*. There should
be some evidence that *block cipher* has similar weakness before we accept
another restriction on the IV setup.

[1] https://en.wikipedia.org/wiki/Stream_cipher_attacks#Reused_key_attack

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: pg_log_fatal vs pg_log_error

2019-07-10 Thread Antonin Houska
Peter Eisentraut  wrote:

> On 2019-06-17 14:19, Antonin Houska wrote:
> > Can anyone please give me a hint (and possibly add some comments to the 
> > code)
> > when pg_log_fatal() should be used in frontend code and when it's 
> > appropriate
> > to call pg_log_error()? The current use does not seem very consistent.
> 
> For a program that runs in a loop, like for example psql or
> pg_receivewal, use error if the program keeps running and fatal if not.
> For one-shot programs like for example createdb, there is no difference,
> so we have used error in those cases.

That makes sense, but shouldn't then pg_log_fatal() perform exit(EXIT_FAILURE)
internally? Just like elog(FATAL) does on backend side.

Actually there are indications that someone would appreciate such behaviour
even in frontends.

In pg_rewind.h I see:

/* logging support */
#define pg_fatal(...) do { pg_log_fatal(__VA_ARGS__); exit(1); } while(0)

or this in pg_upgrade/util.c:

void
pg_fatal(const char *fmt,...)
{
va_list args;

va_start(args, fmt);
pg_log_v(PG_FATAL, fmt, args);
va_end(args);
printf(_("Failure, exiting\n"));
exit(1);
}

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-07-10 Thread David Rowley
On Wed, 3 Jul 2019 at 19:35, Michael Paquier  wrote:
> This has been reverted as of f5db56f, still it seems to me that this
> was moving in the right direction.

I've pushed this again, this time with the cleanup code done in the right order.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




pg_checksums (or checksums in general) vs tableam

2019-07-10 Thread Magnus Hagander
How is this intended to work?

pg_checksums enumerate the files. What if there are files there from a
different tableam? Isn't pg_checksums just going to badly fail then, since
it assumes everything is heap?

Also, do we allow AMs that don't support checksumming data? Do we have any
checks for tables created with such AMs in a system that has checksums
enabled?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/9/19 7:28 PM, Stephen Frost wrote:
> Greetings,
> 
> * Joe Conway (m...@joeconway.com) wrote:
>> On 7/9/19 5:42 PM, Tomas Vondra wrote:
>> > There are two basic ways to construct nonces - CSPRNG and sequences, and
>> > then a combination of both, i.e. one part is generated from a sequence
>> > and one randomly.
>> > 
>> > FWIW not sure using OIDs as nonces directly is a good idea, as those are
>> > inherently low entropy data - how often do you see databases with OIDs
>> > above 1M or so? Probably not very often, and in most cases those are
>> > databases where those OIDs are for OIDs and large objects, so irrelevant
>> > for this purpose. I might be wrong but having a 96-bit nonce with maybe
>> > just 32bits of entrophy seems suspicious.
>> > 
>> > That does not mean we can't use the OIDs at all, but maybe hashing them
>> > into a single 4B value, and then picking the remaining 8B randomly.
>> > Also, we have a "natural" sequence in the database - LSNs, maybe that
>> > would be a good source of nonces too?
>> 
>> I think you missed the quoted part (upthread) from the NIST document:
>> 
>>   "There are two recommended methods for generating unpredictable IVs.
>>The first method is to apply the forward cipher  function, under the
>>same key that is used for the encryption of the plaintext, to a
>>nonce. The nonce must be a data block that is unique to each
>>execution of the encryption operation. For example, the nonce may be
>>a counter, as described in Appendix B, or a message number. The
>>second method is to generate a random data block using a
>>FIPS-approved random number generator."
>> 
>> That first method says a counter as input produces an acceptably
>> unpredictable IV as long as it is unique to each encryption operation.
>> If each page is going to be an "encryption operation", so as long as our
>> input nonce is unique for a given key, we should be ok. If the input
>> nonce is tableoid+pagenum and the key is different per database (at
>> least, hopefully different per tablespace too), we should be good to go,
>> at least from what I can see.
> 
> What I think Tomas is getting at here is that we don't write a page only
> once.
> 
> A nonce of tableoid+pagenum will only be unique the first time we write
> out that page.  Seems unlikely that we're only going to be writing these
> pages once though- what we need is a nonce that's unique for *every
> write* of the 8k page, isn't it?  As every write of the page is going to
> be encrypting something new.


Hmm, good point. I'm not entirely sure it would be required if the two
page versions don't exist at the same time, but I guess backups mean
that it would, so yeah.

> With sufficient randomness, we can at least be more likely to have a
> unique nonce for each 8K write.  Including the LSN seems like it'd be a
> possible alternative.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/9/19 10:06 PM, Stephen Frost wrote:
> Greetings,
> 
> * Ryan Lambert (r...@rustprooflabs.com) wrote:
>> > What I think Tomas is getting at here is that we don't write a page only
>> > once.
>> 
>> > A nonce of tableoid+pagenum will only be unique the first time we write
>> > out that page.  Seems unlikely that we're only going to be writing these
>> > pages once though- what we need is a nonce that's unique for *every
>> > write* of the 8k page, isn't it?  As every write of the page is going to
>> >  be encrypting something new.
>> 
>> > With sufficient randomness, we can at least be more likely to have a
>> > unique nonce for each 8K write.  Including the LSN seems like it'd be a
>> > possible alternative.
>> 
>> Agreed.  I know little of the inner details about the LSN but what I read
>> in [1] sounds encouraging in addition to tableoid + pagenum.
>> 
>> [1] https://www.postgresql.org/docs/current/datatype-pg-lsn.html
> 
> Yes, but it's still something that we'd have to store somewhere- the
> actual LSN of the page is going to be in the 8K block.
> 
> Unless we decide that we can pull the LSN *out* of the 8K block and
> store it unencrypted, and then store the *rest* of the block
> encrypted...  That might also allow things like backup software to work
> on these encrypted data files for page-level backups without needing
> access to the key and that'd be pretty neat.
> 
> Of course, as with anything, the more data you expose, the higher the
> overall risk that someone can figure out some meaning from it.  Still,
> if the idea was that we'd use the LSN in this way, then it'd need to be
> stored unencrypted regardless...

I don't think we are going to be able to eliminate every possible
side-channel anyway -- this seems like a good compromise to me.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 2:40 AM, Masahiko Sawada wrote:
> On Tue, Jul 9, 2019 at 10:16 PM Joe Conway  wrote:
>>
>> On 7/9/19 8:39 AM, Ryan Lambert wrote:
>> > Hi Thomas,
>> >
>> >> CBC mode does require
>> >> random nonces, other modes may be fine with even sequences as long as
>> >> the values are not reused.
>> >
>> > I disagree that CBC mode requires random nonces, at least based on what
>> > NIST has published.  They only require that the IV (not the nonce) must
>> > be unpredictable per [1]:
>> >
>> > " For the CBC and CFB modes, the IVs must be unpredictable."
>> >
>> > The unpredictable IV can be generated from a non-random nonce including
>> > a counter:
>> >
>> > "There are two recommended methods for generating unpredictable IVs. The
>> > first method is to apply the forward cipher function, under the same key
>> > that is used for the encryption of the plaintext, to a nonce. The nonce
>> > must be a data block that is unique to each execution of the encryption
>> > operation. For example, the nonce may be a counter, as described in
>> > Appendix B, or a message number."
>> >
>> > [1] 
>> > https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
>>
>>
>> The terms nonce and IV are often used more-or-less interchangeably, and
>> it is important to be clear when we are talking about an IV specifically
>> - an IV is a specific type of nonce. Nonce means "number used once".
>> i.e. unique, whereas an IV (for CBC use anyway) should be unique and
>> random but not necessarily kept secret.
> 
> FWIW, it seems that predictable IVs can sometimes be harmful. See


Yes, for CBC as I said above "IV ... should be unique and random but not
necessarily kept secret". You can argue if the word "random" should read
"unpredictable" instead, but that was the intention.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 2:38 AM, Masahiko Sawada wrote:
> On Tue, Jul 9, 2019 at 9:01 PM Joe Conway  wrote:
>>
>> On 7/9/19 6:07 AM, Peter Eisentraut wrote:
>> > On 2019-07-08 18:09, Joe Conway wrote:
>> >> In my mind, and in practice to a
>> >> large extent, a postgres tablespace == a unique mount point.
>> >
>> > But a critical difference is that in file systems, a separate mount
>> > point has its own journal.
>>
>> While it would be ideal to have separate WAL, and even separate shared
>> buffer pools, per tablespace, I think that is too much complexity for
>> the first implementation and we could have a single separate key for all
>> WAL for now.
> 
> If we encrypt different tables with different keys I think we need to
> encrypt WAL with the same keys as we used for tables, as per
> discussion so far. And we would need to encrypt each WAL records, not
> whole WAL 8k pages.

That is not a technical requirement to be sure. We may decide we want
that from a security perspective, but that point is debatable. There
have been different goals expressed on this thread:

1. Keep user 1 from decrypting data A and user 2 from decrypting data B
2. Limit the amount of data encrypted with key Kn

We can use K1 for A, K2 for B, and K3 for WAL and achieve goal #2. As
Stephen pointed out, goal #1 would be great to have, but I am not sure
there is consensus that it is required, at least not for the initial
implementation.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Minimal logical decoding on standbys

2019-07-10 Thread Amit Khandekar
On Wed, 10 Jul 2019 at 08:44, Andres Freund  wrote:
>
> Hi,
>
> Thanks for the new version! Looks like we're making progress towards
> something committable here.
>
> I think it'd be good to split the patch into a few pieces. I'd maybe do
> that like:
> 1) WAL format changes (plus required other changes)
> 2) Recovery conflicts with slots
> 3) logical decoding on standby
> 4) tests

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

>
>
> > @@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, 
> > TransactionId latestRemovedXi
> >*/
> >
> >   /* XLOG stuff */
> > + xlrec_reuse.onCatalogTable = 
> > RelationIsAccessibleInLogicalDecoding(rel);
> >   xlrec_reuse.node = rel->rd_node;
> >   xlrec_reuse.block = blkno;
> >   xlrec_reuse.latestRemovedXid = latestRemovedXid;
>
> Hm. I think we otherwise only ever use
> RelationIsAccessibleInLogicalDecoding() on tables, not on indexes.  And
> while I think this would mostly work for builtin catalog tables, it
> won't work for "user catalog tables" as RelationIsUsedAsCatalogTable()
> won't perform any useful checks for indexes.
>
> So I think we either need to look up the table, or pass it down.

Done. Passed down the heap rel.

>
>
> > diff --git a/src/backend/access/heap/heapam.c 
> > b/src/backend/access/heap/heapam.c
> > index d768b9b..10b7857 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
> >   * see comments for vacuum_log_cleanup_info().
> >   */
> >  XLogRecPtr
> > -log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
> > +log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
> >  {
> >   xl_heap_cleanup_info xlrec;
> >   XLogRecPtr  recptr;
> >
> > - xlrec.node = rnode;
> > + xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
> > + xlrec.node = rel->rd_node;
> >   xlrec.latestRemovedXid = latestRemovedXid;
> >
> >   XLogBeginInsert();
> > @@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
> >   /* Caller should not call me on a non-WAL-logged relation */
> >   Assert(RelationNeedsWAL(reln));
> >
> > + xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
>
> It'd probably be a good idea to add a comment to
> RelationIsUsedAsCatalogTable() that it better never invoke anything
> performing catalog accesses. Otherwise there's quite the danger with
> recursion (some operation doing RelationIsAccessibleInLogicalDecoding(),
> that then accessing the catalog, which in turn could again need to
> perform said operation, loop).

Added comments in RelationIsUsedAsCatalogTable() as well as
RelationIsAccessibleInLogicalDecoding() :

 * RelationIsAccessibleInLogicalDecoding
 * True if we need to log enough information to have access via
 * decoding snapshot.
 * This definition should not invoke anything that performs catalog
 * access. Otherwise, e.g. logging a WAL entry for catalog relation may
 * invoke this function, which will in turn do catalog access, which may
 * in turn cause another similar WAL entry to be logged, leading to
 * infinite recursion.

> >  /* Entry in pending-list of TIDs we need to revisit */
> > @@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer 
> > buffer)
> >   OffsetNumber itemnos[MaxIndexTuplesPerPage];
> >   spgxlogVacuumRedirect xlrec;
> >
> > + xlrec.onCatalogTable = 
> > get_rel_logical_catalog(index->rd_index->indrelid);
> >   xlrec.nToPlaceholder = 0;
> >   xlrec.newestRedirectXid = InvalidTransactionId;
>
> We should document that it is safe to do catalog acceses here, because
> spgist is never used to back catalogs. Otherwise there would be an a
> endless recursion danger here.

Comments added.

>
> Did you check how hard it we to just pass down the heap relation?

It does look hard. Check my comments in an earlier reply, that I have
pasted below :

> This one seems harder, but I'm not actually sure why we make it so
> hard. It seems like we just ought to add the table to IndexVacuumInfo.

This means we have to add heapRel assignment wherever we initialize
IndexVacuumInfo structure, namely in lazy_vacuum_index(),
lazy_cleanup_index(), validate_index(), analyze_rel(), and make sure
these functions have a heap rel handle. Do you think we should do this
as part of this patch ?

>
>
> >  /*
> > + * Get the wal_level from the control file.
> > + */
> > +WalLevel
> > +GetActiveWalLevel(void)
> > +{
> > + return ControlFile->wal_level;
> > +}
>
> What does "Active" mean here? I assume it's supposed to indicate that it
> could be different than what's configured in postgresql.conf, for a
> replica? If so, that should be mentioned.

Done. Here are the new comments :
 * Get the wal_level from th

Re: Comment typo in tableam.h

2019-07-10 Thread Amit Kapila
On Mon, Jul 8, 2019 at 10:21 PM Ashwin Agrawal  wrote:
>
>
> On Sat, Jul 6, 2019 at 12:05 AM Amit Kapila  wrote:
>>
>> On Tue, Jul 2, 2019 at 1:00 AM Ashwin Agrawal  wrote:
>> > Please find attached v2 of patch 1 without objectionable comment change. 
>> > v1 of patch 2 attaching here just for convenience, no modifications made 
>> > to it.
>> >
>>
>> 0001*
>>   * See table_index_fetch_tuple's comment about what the difference between
>> - * these functions is. This function is the correct to use outside of
>> - * index entry->table tuple lookups.
>> + * these functions is. This function is correct to use outside of index
>> + * entry->table tuple lookups.
>>
>> How about if we write the last line of comment as "It is correct to
>> use this function outside of index entry->table tuple lookups."?  I am
>> not an expert on this matter, but I find the way I am suggesting
>> easier to read.
>
>
> I am fine with the way you have suggested.
>

Pushed.  I have already pushed your other patch a few days back.  So,
as per my knowledge, we are done here.  Do, let me know if anything
proposed in this thread is pending?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 4:47 AM, Antonin Houska wrote:
> Tomas Vondra  wrote:
>> I don't think that works, because that'd mean we're encrypting the same
>> page with the same nonce over and over, which means reusing the reuse
>> (even if you hash/encrypt it). Or did I miss something?
> 
> I found out that it's wrong to use the same key (or (key, IV) pair) to encrypt
> different plain texts [1], however this is about *stream cipher*. There should
> be some evidence that *block cipher* has similar weakness before we accept
> another restriction on the IV setup.
> 
> [1] https://en.wikipedia.org/wiki/Stream_cipher_attacks#Reused_key_attack

There is plenty of guidance that specifies CBC requires unique,
unpredictable, but not necessarily secret IV. See for example Appendix C
here:

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 4:24 AM, Antonin Houska wrote:
> Joe Conway  wrote:
> 
>> On 7/8/19 6:04 PM, Stephen Frost wrote:
>> > * Bruce Momjian (br...@momjian.us) wrote:
>> >> Uh, well, renaming the user was a big problem, but that is the only case
>> >> I can think of.  I don't see that as an issue for block or WAL sequence
>> >> numbers.  If we want to use a different nonce, we have to find a way to
>> >> store it or look it up efficiently.  Considering the nonce size, I don't
>> >> see how that is possible.
>> > 
>> > No, this also meant that, as an attacker, I *knew* the salt ahead of
>> > time and therefore could build rainbow tables specifically for that
>> > salt.  I could also use those *same* tables for any system where that
>> > user had an account, even if they used different passwords on different
>> > systems...
>> > 
>> > I appreciate that *some* of this might not be completely relevant for
>> > the way a nonce is used in cryptography, but I'd be very surprised to
>> > have a cryptographer tell me that a deterministic nonce didn't have
>> > similar issues or didn't reduce the value of the nonce significantly.
>> 
>> I have worked side by side on projects with bona fide cryptographers and
>> I can assure you that they recommended random nonces. Granted, that was
>> in the early 2000s, but I don't think "modern cryptography" has changed
>> that any more than "web scale" has made Postgres irrelevant in the
>> intervening years.
> 
> I think that particular threads have to be considered.
> 
>> Related links:
> 
>> https://defuse.ca/cbcmodeiv.htm
>> https://www.cryptofails.com/post/70059609995/crypto-noobs-1-initialization-vectors
> 
> The first one looks more in-depth than the other one, so I focused on it:
> 
> * "Statistical Correlations between IV and Plaintext"
> 
> My understanding is that predictability of the IV (in our implementation of
> full-instance encryption [1] we derive the IV from RelFileNode combined with
> block number) can reveal information about the first encryption block (16
> bytes) of the page, i.e. part of the PageHeaderData structure. I don't think
> this leaks any valuable data. And starting the 2nd block, the IV is not
> predictable because it is cipher text of the previous block.
> 
> * "Chosen-Plaintext Attacks"
> 
> The question here is whether we expect the OS admin to have access to the
> database. In [1] we currently don't (cloud, where DBA has no control over the
> storage layer is the main use case), but if it appears to be the requirement,
> I believe CBC-ESSIV mode [2] can fix the problem.
> 
> Anyway, I'm not sure if this kind of attack can reveal more information than
> something about the first block of the page (the page header), since each of
> the following blocks uses ciphertext of the previous block as the IV.
> 
> * "Altering the IV Before Decryption"
> 
> I don't think this attack needs special attention - page checksums should
> reveal it.
> 
> 
> [1] https://commitfest.postgresql.org/23/2104/
> [2] 
> https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_.28ESSIV.29
> 

Please see my other reply (and
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
appendix C as pointed out by Ryan downthread).

At least in my mind, I trust a published specification from the
nation-state level over random blogs or wikipedia. If we can find some
equivalent published standards that contradict NIST we should discuss
it, but for my money I would prefer to stick with the NIST recommended
method to produce the IVs.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: pg_receivewal documentation

2019-07-10 Thread Jesper Pedersen

Hi,

On 7/9/19 6:22 PM, Laurenz Albe wrote:

Works for me.

Marked as "ready for committer".



Thank you !

Best regards,
 Jesper





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Stephen Frost
Greetings,

* Joe Conway (m...@joeconway.com) wrote:
> On 7/9/19 7:28 PM, Stephen Frost wrote:
> > * Joe Conway (m...@joeconway.com) wrote:
> >> On 7/9/19 5:42 PM, Tomas Vondra wrote:
> >> > There are two basic ways to construct nonces - CSPRNG and sequences, and
> >> > then a combination of both, i.e. one part is generated from a sequence
> >> > and one randomly.
> >> > 
> >> > FWIW not sure using OIDs as nonces directly is a good idea, as those are
> >> > inherently low entropy data - how often do you see databases with OIDs
> >> > above 1M or so? Probably not very often, and in most cases those are
> >> > databases where those OIDs are for OIDs and large objects, so irrelevant
> >> > for this purpose. I might be wrong but having a 96-bit nonce with maybe
> >> > just 32bits of entrophy seems suspicious.
> >> > 
> >> > That does not mean we can't use the OIDs at all, but maybe hashing them
> >> > into a single 4B value, and then picking the remaining 8B randomly.
> >> > Also, we have a "natural" sequence in the database - LSNs, maybe that
> >> > would be a good source of nonces too?
> >> 
> >> I think you missed the quoted part (upthread) from the NIST document:
> >> 
> >>   "There are two recommended methods for generating unpredictable IVs.
> >>The first method is to apply the forward cipher  function, under the
> >>same key that is used for the encryption of the plaintext, to a
> >>nonce. The nonce must be a data block that is unique to each
> >>execution of the encryption operation. For example, the nonce may be
> >>a counter, as described in Appendix B, or a message number. The
> >>second method is to generate a random data block using a
> >>FIPS-approved random number generator."
> >> 
> >> That first method says a counter as input produces an acceptably
> >> unpredictable IV as long as it is unique to each encryption operation.
> >> If each page is going to be an "encryption operation", so as long as our
> >> input nonce is unique for a given key, we should be ok. If the input
> >> nonce is tableoid+pagenum and the key is different per database (at
> >> least, hopefully different per tablespace too), we should be good to go,
> >> at least from what I can see.
> > 
> > What I think Tomas is getting at here is that we don't write a page only
> > once.
> > 
> > A nonce of tableoid+pagenum will only be unique the first time we write
> > out that page.  Seems unlikely that we're only going to be writing these
> > pages once though- what we need is a nonce that's unique for *every
> > write* of the 8k page, isn't it?  As every write of the page is going to
> > be encrypting something new.
> 
> Hmm, good point. I'm not entirely sure it would be required if the two
> page versions don't exist at the same time, but I guess backups mean
> that it would, so yeah.

Uh, or an attacker got a copy of the page and then just waited a few
minutes for a new version to be written and then grabbed that...

Definitely not limited to just concerns about the fact that other
versions would exist in backups too.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 2:45 AM, Masahiko Sawada wrote:
> On Wed, Jul 10, 2019 at 11:06 AM Stephen Frost  wrote:
>>
>> Greetings,
>>
>> * Ryan Lambert (r...@rustprooflabs.com) wrote:
>> > > What I think Tomas is getting at here is that we don't write a page only
>> > > once.
>> >
>> > > A nonce of tableoid+pagenum will only be unique the first time we write
>> > > out that page.  Seems unlikely that we're only going to be writing these
>> > > pages once though- what we need is a nonce that's unique for *every
>> > > write* of the 8k page, isn't it?  As every write of the page is going to
>> > >  be encrypting something new.
>> >
>> > > With sufficient randomness, we can at least be more likely to have a
>> > > unique nonce for each 8K write.  Including the LSN seems like it'd be a
>> > > possible alternative.
>> >
>> > Agreed.  I know little of the inner details about the LSN but what I read
>> > in [1] sounds encouraging in addition to tableoid + pagenum.
>> >
>> > [1] https://www.postgresql.org/docs/current/datatype-pg-lsn.html
>>
>> Yes, but it's still something that we'd have to store somewhere- the
>> actual LSN of the page is going to be in the 8K block.
> 
> Can we use CBC-ESSIV[1] or XTS[2] instead? IIUC with these modes we
> can use table oid and page number for IV or tweak and we don't need to
> change them each time to encrypt pages.
> 
> [1] 
> https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_.28ESSIV.29
> [2] 
> https://en.wikipedia.org/wiki/Disk_encryption_theory#XEX-based_tweaked-codebook_mode_with_ciphertext_stealing_(XTS)


From what I can tell [1] is morally equivalent to the NIST method and
does nothing to change the fact that the input nonce needs to be unique
for each encryption operation. I have not had time to review [2] yet...

While it would be very tempting to convince ourselves that a unique
input nonce is not a requirement, I think we are better off being
conservative unless we find some extremely clear guidance that allows us
to draw that conclusion.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 8:34 AM, Stephen Frost wrote:
> Greetings,
> 
> * Joe Conway (m...@joeconway.com) wrote:
>> On 7/9/19 7:28 PM, Stephen Frost wrote:
>> > * Joe Conway (m...@joeconway.com) wrote:
>> >> On 7/9/19 5:42 PM, Tomas Vondra wrote:
>> >> > There are two basic ways to construct nonces - CSPRNG and sequences, and
>> >> > then a combination of both, i.e. one part is generated from a sequence
>> >> > and one randomly.
>> >> > 
>> >> > FWIW not sure using OIDs as nonces directly is a good idea, as those are
>> >> > inherently low entropy data - how often do you see databases with OIDs
>> >> > above 1M or so? Probably not very often, and in most cases those are
>> >> > databases where those OIDs are for OIDs and large objects, so irrelevant
>> >> > for this purpose. I might be wrong but having a 96-bit nonce with maybe
>> >> > just 32bits of entrophy seems suspicious.
>> >> > 
>> >> > That does not mean we can't use the OIDs at all, but maybe hashing them
>> >> > into a single 4B value, and then picking the remaining 8B randomly.
>> >> > Also, we have a "natural" sequence in the database - LSNs, maybe that
>> >> > would be a good source of nonces too?
>> >> 
>> >> I think you missed the quoted part (upthread) from the NIST document:
>> >> 
>> >>   "There are two recommended methods for generating unpredictable IVs.
>> >>The first method is to apply the forward cipher  function, under the
>> >>same key that is used for the encryption of the plaintext, to a
>> >>nonce. The nonce must be a data block that is unique to each
>> >>execution of the encryption operation. For example, the nonce may be
>> >>a counter, as described in Appendix B, or a message number. The
>> >>second method is to generate a random data block using a
>> >>FIPS-approved random number generator."
>> >> 
>> >> That first method says a counter as input produces an acceptably
>> >> unpredictable IV as long as it is unique to each encryption operation.
>> >> If each page is going to be an "encryption operation", so as long as our
>> >> input nonce is unique for a given key, we should be ok. If the input
>> >> nonce is tableoid+pagenum and the key is different per database (at
>> >> least, hopefully different per tablespace too), we should be good to go,
>> >> at least from what I can see.
>> > 
>> > What I think Tomas is getting at here is that we don't write a page only
>> > once.
>> > 
>> > A nonce of tableoid+pagenum will only be unique the first time we write
>> > out that page.  Seems unlikely that we're only going to be writing these
>> > pages once though- what we need is a nonce that's unique for *every
>> > write* of the 8k page, isn't it?  As every write of the page is going to
>> > be encrypting something new.
>> 
>> Hmm, good point. I'm not entirely sure it would be required if the two
>> page versions don't exist at the same time, but I guess backups mean
>> that it would, so yeah.
> 
> Uh, or an attacker got a copy of the page and then just waited a few
> minutes for a new version to be written and then grabbed that...
> 
> Definitely not limited to just concerns about the fact that other
> versions would exist in backups too.

Agreed :-/

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: extension patch of CREATE OR REPLACE TRIGGER

2019-07-10 Thread Surafel Temesgen
Hi Takamichi Osumi,
On Tue, Jul 9, 2019

> I've rebased the previous patch to be applied
>

I don't test your patch fully yet but here are same comment.
There are same white space issue like here
-  bool is_internal)
+  bool is_internal,
+  Oid existing_constraint_oid)
in a few place

+ // trigoid = HeapTupleGetOid(tuple); // raw code
please remove this line if you don't use it.

+ if(!existing_constraint_oid){
+ conOid = GetNewOidWithIndex(conDesc, ConstraintOidIndexId,
+ Anum_pg_constraint_oid);
+ values[Anum_pg_constraint_oid - 1] = ObjectIdGetDatum(conOid);
+ }
incorrect bracing style here and its appear in a few other places too
and it seems to me that the change in regression test is
huge can you reduce it?

regards
Surafel


Re: pg_receivewal documentation

2019-07-10 Thread Jesper Pedersen

Hi,

On 7/10/19 4:04 AM, Michael Paquier wrote:

On Wed, Jul 10, 2019 at 12:22:02AM +0200, Laurenz Albe wrote:

Works for me.

Marked as "ready for committer".


Hmm.  synchronous_commit is user-settable, which means that it is
possible to enforce a value in the connection string doing the
connection.  Isn't that something we had better enforce directly in
the tool?  In this case what could be fixed is GetConnection() which
builds the connection string parameters.  One thing that we would need
to be careful about is that if the caller has provided a parameter for
"options" (which is plausible as wal_sender_timeout is user-settable
as of 12), then we need to make sure that the original value is
preserved, and that the enforced of synchronous_commit is appended.



I think that the above is out-of-scope for this patch. And ...


Or, as you say, we just adjust the documentation.  However I would
recommend adding at least an example of connection string which uses
"options" if the server sets synchronous_commit to "remote_apply" in
this case.  Still it seems to me that we have ways to reduce the
confusion automatically.



The patch tries to highlight that if you f.ex. have

postgresql.conf
===
synchronous_commit = remote_apply
synchronous_standby_names = '*'

and you _only_ have pg_receivewal connected then changes are only 
applied locally to the primary instance and any client (psql, ...) won't 
get acknowledged. The replay_lsn for the pg_receivewal connection will 
keep increasing, so


env PGOPTIONS="-c synchronous_commit=remote_write" pg_receivewal -D 
/tmp/wal -S replica1 --synchronous


won't help you.

We could add some wording around 'synchronous_standby_names' if it makes 
the case clearer.


Best regards,
 Jesper




Re: pg_checksums (or checksums in general) vs tableam

2019-07-10 Thread Michael Paquier
On Wed, Jul 10, 2019 at 11:42:34AM +0200, Magnus Hagander wrote:
> pg_checksums enumerate the files. What if there are files there from a
> different tableam? Isn't pg_checksums just going to badly fail then, since
> it assumes everything is heap?
> 
> Also, do we allow AMs that don't support checksumming data? Do we have any
> checks for tables created with such AMs in a system that has checksums
> enabled?

Table AMs going through shared buffers and smgr.c, like zedstore,
share the same page header, meaning that the on-disk file is the same
as heap, and that checksums are compiled similarly to heap.
pg_checksums is not going to complain on those ones and would work
just fine.

Table AMs using their own storage layer (which would most likely use
their own checksum method normally?) would be ignored by pg_checksums
if the file names don't match what smgr uses, but it could result in
failures if they use on-disk file names which match.
--
Michael


signature.asc
Description: PGP signature


Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-10 Thread Melanie Plageman
On Wed, Jul 10, 2019 at 12:40 AM Michael Paquier 
wrote:

> On Wed, Jul 10, 2019 at 12:51:28PM +0530, Amit Kapila wrote:
> > It would be good if we can come up with something like that.  It will
> > be helpful for zheap, where in some cases we get different row
> > ordering due to in-place updates.  As of now, we try to add Order By
> > or do some extra magic to get consistent row ordering.
>
> That was an issue for me as well when working with Postgres-XC when
> the row ordering was not guaranteed depending on the number of nodes
> (speaking of which Greenplum has the same issues, no?).  Adding ORDER
> BY clauses to a set of tests may make sense, but then this may impact
> the plans generated for some of them..
> --
> Michael
>

We have a tool that does this. gpdiff [1] is used for results
post-processing
and it uses a perl module called atmsort [2] to deal with the specific
ORDER BY
case discussed here.

[1]
https://github.com/greenplum-db/gpdb/blob/master/src/test/regress/gpdiff.pl
[2]
https://github.com/greenplum-db/gpdb/blob/master/src/test/regress/atmsort.pl

-- 
Melanie Plageman


Re: Index Skip Scan

2019-07-10 Thread Jesper Pedersen

Hi,

On 7/9/19 10:14 PM, Thomas Munro wrote:

Thomas, do you have any ideas for this ? I can see that MySQL did the
functionality in two change sets (base and function support), but like
you said we shouldn't paint ourselves into a corner.


I think amskip() could be augmented by later patches to take a
parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
SKIP_FIRST | SKIP_LAST (meaning please give me both).  What we have in
the current patch is SKIP_FIRST behaviour.  Being able to choose
SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
ORDER BY i (ie forward scan for the order, but wanting the highest key
for each distinct prefix), and being able to choose both allows for i,
MIN(j), MAX(j) GROUP BY i.  Earlier I thought that this scheme that
allows you to ask for both might be incompatible with David's
suggestion of tracking UniqueKeys in paths, but now I see that it
isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
and therefore require a regular agg on top.  I suspect that's OK: this
min/max transformation stuff is highly specialised and can have
whatever magic path-building logic is required in
preprocess_minmax_aggregates().



Ok, great.

Thanks for your feedback !

Best regards,
 Jesper




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-07-10 Thread Robert Haas
On Tue, Jul 9, 2019 at 2:32 PM Dave Cramer  wrote:
> So did this die from lack of interest?
>
> I have proposed in another thread adding more GUC REPORT variables, but I see 
> this as a much better way.
>
> I'm willing to code the patch if we can get some buy in here ?

It seemed like most people at least didn't hate the idea completely,
and some liked it, so I think it would be worth revisiting.  If you
decide to write a patch, I'll try to help review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: progress report for ANALYZE

2019-07-10 Thread Robert Haas
On Tue, Jul 9, 2019 at 6:12 PM Alvaro Herrera  wrote:
> Hmm, ok.  In CREATE INDEX, we use the block counters multiple times.

Why do we do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: progress report for ANALYZE

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-10, Robert Haas wrote:

> On Tue, Jul 9, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
> > Hmm, ok.  In CREATE INDEX, we use the block counters multiple times.
> 
> Why do we do that?

Because we scan the table first, then the index, then the table again
(last two for the validation phase of CIC).  We count "block numbers"
separately for each of those, and keep those counters in the same pair
of columns.  I think we also do that for tuple counters in one case.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: global / super barriers (for checksums)

2019-07-10 Thread Magnus Hagander
On Tue, Oct 30, 2018 at 6:16 AM Andres Freund  wrote:

> Hi,
>
> Magnus cornered me at pgconf.eu and asked me whether I could prototype
> the "barriers" I'd been talking about in the online checksumming thread.
>
> The problem there was to make sure that all processes, backends and
> auxiliary processes have seen the new state of checksums being enabled,
> and aren't currently in the process of writing a new page out.
>
> The current prototype solves that by requiring a restart, but that
> strikes me as a far too large hammer.
>
> The attached patch introduces "global barriers" (name was invented in a
> overcrowded hotel lounge, so ...), which allow to wait for such a change
> to be absorbed by all backends.
>
> I've only tested the code with gdb, but that seems to work:
>
> p WaitForGlobalBarrier(EmitGlobalBarrier(GLOBBAR_CHECKSUM))
>
> waits until all backends (including bgwriter, checkpointers, walwriters,
> bgworkers, ...) have accepted interrupts at least once.  Multiple such
> requests are coalesced.
>
> I decided to wait until interrupts are actually process, rather than
> just the signal received, because that means the system is in a well
> defined state. E.g. there's no pages currently being written out.
>
> For the checksum enablement patch you'd do something like;
>
> EnableChecksumsInShmemWithLock();
> WaitForGlobalBarrier(EmitGlobalBarrier(GLOBBAR_CHECKSUM));
>
> and after that you should be able to set it to a perstistent mode.
>
>
> I chose to use procsignals to send the signals, a global uint64
> globalBarrierGen, and per-backend barrierGen, barrierFlags, with the
> latter keeping track which barriers have been requested. There likely
> seem to be other usecases.
>
>
> The patch definitely is in a prototype stage. At the very least it needs
> a high-level comment somewhere, and some of the lower-level code needs
> to be cleaned up.
>
> One thing I wasn't happy about is how checksum internals have to absorb
> barrier requests - that seems unavoidable, but I'd hope for something
> more global than just BufferSync().
>
>
> Comments?
>
>

Finally getting back to this one.

In re-reading this, I notice there are a lot of references to Intterrupt
(with two t). I'm guessing this is just a spelling error, and not something
that actually conveys some meaning?

Can you elaborate on what you mean with:
+   /* XXX: need a more principled approach here */

Is that the thing you refer to above about "checksum internals"?

Also in checking we figured it'd be nice to have a wait event for this,
since a process can potentially get stuck in an infinite loop waiting for
some other process if it's misbehaving. Kind of like the attached?

//Magnus
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a84042a4ea..58b360d225 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3765,6 +3765,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_EXECUTE_GATHER:
 			event_name = "ExecuteGather";
 			break;
+		case WAIT_EVENT_GLOBAL_BARRIER:
+			event_name = "GlobalBarrier";
+			break;
 		case WAIT_EVENT_HASH_BATCH_ALLOCATING:
 			event_name = "Hash/Batch/Allocating";
 			break;
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4e3e4a4893..9aed52df4a 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -21,6 +21,7 @@
 #include "access/twophase.h"
 #include "commands/async.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
 #include "storage/ipc.h"
@@ -368,6 +369,7 @@ EmitGlobalBarrier(GlobalBarrierKind kind)
 void
 WaitForGlobalBarrier(uint64 generation)
 {
+	pgstat_report_wait_start(WAIT_EVENT_GLOBAL_BARRIER);
 	for (int i = 0; i < (MaxBackends + max_prepared_xacts); i++)
 	{
 		PGPROC *proc = &ProcGlobal->allProcs[i];
@@ -389,6 +391,7 @@ WaitForGlobalBarrier(uint64 generation)
 			oldval = pg_atomic_read_u64(&proc->barrierGen);
 		}
 	}
+	pgstat_report_wait_end();
 }
 
 /*
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2b656a8168..0d35b19420 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -827,6 +827,7 @@ typedef enum
 	WAIT_EVENT_CHECKPOINT_DONE,
 	WAIT_EVENT_CHECKPOINT_START,
 	WAIT_EVENT_EXECUTE_GATHER,
+	WAIT_EVENT_GLOBAL_BARRIER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATING,
 	WAIT_EVENT_HASH_BATCH_ELECTING,
 	WAIT_EVENT_HASH_BATCH_LOADING,


Re: Contribution to Perldoc for TestLib module in Postgres

2019-07-10 Thread Alvaro Herrera
On 2019-Apr-11, Iwata, Aya wrote:

> In the above document, why not write a description after the function name?
> I think it is better to write the function name first and then the 
> description.
> In your code;
>   #Checks if all the tests passed or not
>  all_tests_passing()
> 
> In my suggestion;
>   all_tests_passing()
>   Checks if all the tests passed or not

Yeah, so there are two parts in the submitted patch: first the synopsis
list the methods using this format you describe, and later the METHODS
section lists then again, using your suggested style.  I think we should
do away with the long synopsis -- maybe keep it as just the "use
TestLib" line, and then let the METHODS section list and describe the
methods.

> And some functions return value. How about adding return information
> to the above doc?

That's already in the METHODS section for some of them.  For example:

all_tests_passing()
Returns 1 if all the tests pass. Otherwise returns 0

It's missing for others, such as "tempdir".

In slurp_file you have this:
  Opens the file provided as an argument to the function in read mode(as
  indicated by <).
I think the parenthical comment is useless; remove that.

Please break long source lines (say to 78 chars -- make sure pgperltidy
agrees), and keep some spaces after sentence-ending periods and other
punctuation.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-10 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jul 10, 2019 at 12:51:28PM +0530, Amit Kapila wrote:
>> It would be good if we can come up with something like that.  It will
>> be helpful for zheap, where in some cases we get different row
>> ordering due to in-place updates.  As of now, we try to add Order By
>> or do some extra magic to get consistent row ordering.

> That was an issue for me as well when working with Postgres-XC when
> the row ordering was not guaranteed depending on the number of nodes
> (speaking of which Greenplum has the same issues, no?).  Adding ORDER
> BY clauses to a set of tests may make sense, but then this may impact
> the plans generated for some of them..

Yeah, I do not want to get into a situation where we can't test
queries that lack ORDER BY.  Also, the fact that tableam X doesn't
reproduce heap's row ordering is not a good reason to relax the
strength of the tests for heap.  So I'm wondering about some
postprocessing that we could optionally apply.  Perhaps the tools
Melanie mentions could help.

regards, tom lane




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-07-10 Thread Dave Cramer
On Wed, 10 Jul 2019 at 09:11, Robert Haas  wrote:

> On Tue, Jul 9, 2019 at 2:32 PM Dave Cramer  wrote:
> > So did this die from lack of interest?
> >
> > I have proposed in another thread adding more GUC REPORT variables, but
> I see this as a much better way.
> >
> > I'm willing to code the patch if we can get some buy in here ?
>
> It seemed like most people at least didn't hate the idea completely,
> and some liked it, so I think it would be worth revisiting.  If you
> decide to write a patch, I'll try to help review.
>

Awesome! I've already started working on the patch.

I'm still a bit conflicted about what to do with search_path as I do
believe this is potentially a security issue.
It may be that we always want to report that and possibly back patch it.

Dave


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Tomas Vondra

On Tue, Jul 09, 2019 at 10:06:33PM -0400, Stephen Frost wrote:

Greetings,

* Ryan Lambert (r...@rustprooflabs.com) wrote:

> What I think Tomas is getting at here is that we don't write a page only
> once.



Yes, that's what I meant.


> A nonce of tableoid+pagenum will only be unique the first time we write
> out that page.  Seems unlikely that we're only going to be writing these
> pages once though- what we need is a nonce that's unique for *every
> write* of the 8k page, isn't it?  As every write of the page is going to
>  be encrypting something new.

> With sufficient randomness, we can at least be more likely to have a
> unique nonce for each 8K write.  Including the LSN seems like it'd be a
> possible alternative.

Agreed.  I know little of the inner details about the LSN but what I read
in [1] sounds encouraging in addition to tableoid + pagenum.

[1] https://www.postgresql.org/docs/current/datatype-pg-lsn.html


Yes, but it's still something that we'd have to store somewhere- the
actual LSN of the page is going to be in the 8K block.

Unless we decide that we can pull the LSN *out* of the 8K block and
store it unencrypted, and then store the *rest* of the block
encrypted...  That might also allow things like backup software to work
on these encrypted data files for page-level backups without needing
access to the key and that'd be pretty neat.

Of course, as with anything, the more data you expose, the higher the
overall risk that someone can figure out some meaning from it.  Still,
if the idea was that we'd use the LSN in this way, then it'd need to be
stored unencrypted regardless...



Elsewhere in this thread I've already proposed to leave a bit of space at
the end of a page unencrypted, with page-level encryption metadata. That
might be the nonce (no matter how we end up computing it), key ID used to
encrypt this page, etc.

I don't think we need to put the whole LSN into the nonce in plaintext.
What I was imagining was intead using something like

   sha2(LSN, oid, blockno, random())

or something like that.

Of course, having the LSN (and other stuff like page checksum) unencrypted
would be pretty useful - as you note.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-09, Julien Rouhaud wrote:

> I finished to do a better refactoring, and ended up with this API in
> scripts_parallel:

Looking good!  I'm not sure about the "Consume" word in ConsumeIdleSlot;
maybe "Reserve"? "Obtain"? "Get"?

Code commentary: I think the comment that sits atop the function should
describe what the function does without getting too much in how it does
it.  For example in ConsumeIdleSlot you have "If there are multiples
slots, here we wait for one connection to become available if none
already is, returning NULL if an error occured.  Otherwise, we simply
use the only slot we have, which we know to be free." which seems like
it should be in another comment *inside* the function; make the external
one something like "Reserve and return a connection that is currently
idle, waiting until one becomes idle if none is".  Maybe you can put the
part I first quoted as a second paragraph in the comment at top of
function and keeping the second part I quoted as first paragraph; we
seem to use that style too.

Placement: I think it's good if related functions stay together, or
there is some other rationale for placement within the file.  I have two
favorite approaches: one is to put all externally callable functions at
top of file, followed by all the static helpers in the lower half of the
file.  The other is to put each externally accessible immediately
followed by its specific static helpers.  If you choose one of those,
that means that SetupParallelSlots should either move upwards, or move
downwards.  The current ordering seems a dartboard kind of thing where
the thrower is not Green Arrow.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [proposal] de-TOAST'ing using a iterator

2019-07-10 Thread Binguo Bao
Hi Thomas,
I've fixed the warnings.

Thomas Munro  于2019年7月5日周五 下午12:21写道:

> On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao  wrote:
> > Hi hackers!
> > This proposal aims to provide the ability to de-TOAST a fully TOAST'd
> and compressed field using an iterator and then update the appropriate
> parts of the code to use the iterator where possible instead of
> de-TOAST'ing and de-compressing the entire value. Examples where this can
> be helpful include using position() from the beginning of the value, or
> doing a pattern or substring match.
> >
> > de-TOAST iterator overview:
> > 1. The caller requests the slice of the attribute value from the
> de-TOAST iterator.
> > 2. The de-TOAST iterator checks if there is a slice available in the
> output buffer, if there is, return the result directly,
> > otherwise goto the step3.
> > 3. The de-TOAST iterator checks if there is the slice available in the
> input buffer, if there is, goto step44. Otherwise,
> > call fetch_datum_iterator to fetch datums from disk to input buffer.
> > 4. If the data in the input buffer is compressed, extract some data from
> the input buffer to the output buffer until the caller's
> > needs are met.
> >
> > I've implemented the prototype and apply it to the position() function
> to test performance.
>
> Hi Binguo,
>
> Interesting work, and nice performance improvements so far.  Just by
> the way, the patch currently generates warnings:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719
>
> --
> Thomas Munro
> https://enterprisedb.com
>
From 0ac012867be04b409bfc4e67306fd9f3e5b4785a Mon Sep 17 00:00:00 2001
From: BBG 
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 489 +++
 src/backend/utils/adt/varlena.c  |  48 ++--
 src/include/access/tuptoaster.h  |  92 +++
 3 files changed, 613 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..07bb69e 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+	 DetoastIterator iter, int32 length);
 
 
 /* --
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* --
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * --
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		it

Re: pg_receivewal documentation

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-09, Jesper Pedersen wrote:

> +   
> +Note that while WAL will be flushed with this setting,
> +it will never be applied, so  linkend="guc-synchronous-commit"/> must
> +not be set to remote_apply if 
> pg_receivewal
> +is the only synchronous standby.
> +   

+1 to document this caveat.

How about 
Note that while WAL will be flushed with this setting,
pg_receivewal never applies it, so
 must not be set to
remote_apply if 
pg_receivewal
is the only synchronous standby.
?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Tomas Vondra

On Wed, Jul 10, 2019 at 03:38:54PM +0900, Masahiko Sawada wrote:

On Tue, Jul 9, 2019 at 9:01 PM Joe Conway  wrote:


On 7/9/19 6:07 AM, Peter Eisentraut wrote:
> On 2019-07-08 18:09, Joe Conway wrote:
>> In my mind, and in practice to a
>> large extent, a postgres tablespace == a unique mount point.
>
> But a critical difference is that in file systems, a separate mount
> point has its own journal.

While it would be ideal to have separate WAL, and even separate shared
buffer pools, per tablespace, I think that is too much complexity for
the first implementation and we could have a single separate key for all
WAL for now.


If we encrypt different tables with different keys I think we need to
encrypt WAL with the same keys as we used for tables, as per
discussion so far. And we would need to encrypt each WAL records, not
whole WAL 8k pages.



I don't think we actually need that - we need to ensure that users don't
have access to the key used to encrypt WAL.

This is very much a question of the threat model. If we see TDE as a
data-at-rest solution, then I think it's fine to have a separate keyring
with such keys, and only allow access to system processes.

If our thread model includes cases where people can read memory (does not
matter if it's because of a vulnerability, privilege escalation or just
allowing the people to load an extension with custom C function), then I
think we've already lost. Those people will be able to read the keys
anyway, no matter how many keys are used to encrypt the WAL.

We may need to change how WAL writing works, so that individual backends
don't really write into the WAL buffers directly, and intead hand-over the
data to a separate process (with access to the key). We already have the
walwriter, but IIRC it may not be running and we consider that to be OK.
Or maybe we could have "encrypter" process that does just that.

That's surely non-trivial work, but it seems like much less work compared
to reworking the WAL format to allow WAL to be encrypted with different
keys. At least for v0 that should be OK.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Index Skip Scan

2019-07-10 Thread Floris Van Nee
> Here is finally a new version of the patch, where all the mentioned issues

> seems to be fixed and the corresponding new tests should keep it like that
> (I've skipped all the pubs at PostgresLondon for that).


Thanks for the new patch! Really appreciate the work you're putting into it.


I verified that the backwards index scan is indeed functioning now. However, 
I'm afraid it's not that simple, as I think the cursor case is broken now. I 
think having just the 'scan direction' in the btree code is not enough to get 
this working properly, because we need to know whether we want the minimum or 
maximum element of a certain prefix. There are basically four cases:

- Forward Index Scan + Forward cursor: we want the minimum element within a 
prefix and we want to skip 'forward' to the next prefix

- Forward Index Scan + Backward cursor: we want the minimum element within a 
prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Forward cursor: we want the maximum element within a 
prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Backward cursor: we want the maximum element within a 
prefix and we want to skip 'forward' to the next prefix

These cases make it rather complicated unfortunately. They do somewhat tie in 
with the previous discussion on this thread about being able to skip to the min 
or max value. If we ever want to support a sort of minmax scan, we'll encounter 
the same issues.


Also, I think in planner.c, line 4831, we should actually be looking at whether 
uniq_distinct_pathkeys is NIL, rather than the current check for 
distinct_pathkeys. That'll make the planner pick the skip scan even with 
queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick 
the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. 
This leads to it scanning every item for a=2 even though it can stop after the 
first one.


I'll do some more tests with the patch.


-Floris



Re: Optimize partial TOAST decompression

2019-07-10 Thread Tomas Vondra

On Wed, Jul 10, 2019 at 01:35:25PM +0800, Binguo Bao wrote:

Tomas Vondra  于2019年7月10日周三 上午5:12写道:


On Sat, Jul 06, 2019 at 05:23:37PM +0200, Tomas Vondra wrote:
>On Sat, Jul 06, 2019 at 02:27:56AM +0800, Binguo Bao wrote:
>>Hi, Tomas!
>>Thanks for your testing and the suggestion.
>>
>>That's quite bizarre behavior - it does work with a prefix, but not with
>>>suffix. And the exact ERROR changes after the prefix query.
>>
>>
>>I think bug is caused by "#2  0x004c3b08 in
>>heap_tuple_untoast_attr_slice (attr=, sliceoffset=0,
>>slicelength=-1) at tuptoaster.c:315",
>>since I ignore the case where slicelength is negative, and I've appended
>>some comments for heap_tuple_untoast_attr_slice for the case.
>>
>>FWIW the new code added to this
>>>function does not adhere to our code style, and would deserve some
>>>additional explanation of what it's doing/why. Same for the
>>>heap_tuple_untoast_attr_slice, BTW.
>>
>>
>>I've added more comments to explain the code's behavior.
>>Besides, I also modified the macro "TOAST_COMPRESS_RAWDATA" to
>>"TOAST_COMPRESS_DATA" since
>>it is used to get toast compressed data rather than raw data.
>>
>
>Thanks, this seems to address the issue - I can no longer reproduce the
>crashes, allowing the benchmark to complete. I'm attaching the script I
>used and spreadsheet with a summary of results.
>
>For the cases I've tested (100k - 10M values, different compressibility,
>different prefix/length values), the results are kinda mixed - many
>cases got much faster (~2x), but other cases got slower too. We're
>however talking about queries taking a couple of miliseconds, so in
>absolute numbers the differences are small.
>
>That does not mean the optimization is useless - but the example shared
>at the beginning of this thread is quite extreme, as the values are
>extremely compressible. Each value is ~38MB (in plaintext), but a table
>with 100 such values has only ~40MB. Tha's 100:1 compression ratio,
>which I think is not typical for real-world data sets.
>
>The data I've used are less extreme, depending on the fraction of random
>data in values.
>
>I went through the code too. I've reworder a couple of comments and code
>style issues, but there are a couple of more serious issues.
>
>
>1) Why rename TOAST_COMPRESS_RAWDATA to TOAST_COMPRESS_DATA?
>
>This seems unnecessary, and it discards the clear hint that it's about
>accessing the *raw* data, and the relation to TOAST_COMPRESS_RAWSIZE.
>IMHO we should keep the original naming.
>
>
>2) pglz_maximum_compressed_size signatures are confusing
>
>There are two places with pglz_maximum_compressed_size signature, and
>those places are kinda out of sync when it comes to parameter names:
>
> int32
> pglz_maximum_compressed_size(int32 raw_slice_size,
>  int32 total_compressed_size)
>
> extern
> int32 pglz_maximum_compressed_size(int32 raw_slice_size,
>int32 raw_size);
>
>Also, pg_lzcompress.c has no concept of a "slice" because it only deals
>with simple compression, slicing is responsibility of the tuptoaster. So
>we should not mix those two, not even in comments.
>
>
>I propose tweaks per the attached patch - I think it makes the code
>clearer, and it's mostly cosmetic stuff. But I haven't tested the
>changes beyond "it compiles".
>
>
>regards
>

FWIW I've done another round of tests with larger values (up to ~10MB)
and the larger the values the better the speedup (a bit as expected).
Attached is the script I used and a spreasheet with a summary.

We still need a patch addressing the review comments, though.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






Hi, Tomas!  Sorry for the late reply.  Thank you for further testing, I
am trying to reproduce your first test summary, since I think the
performance of the patch will not drop in almost all cases.



I've done some changes to the test script since the first benchmark,
aiming to make the results more stable

1) uses larger amount of data (10x more)

2) the data set recreated for each run (to rule out random differences in
the random data affecting the runs differently)

3) minor configuration changes (more shared buffers etc.)

I don't think we need to worry about small differences (within ~5%) which
can easily be due to changes to binary layout. And otherwise results from
the second benchmark round seem much more stable.


Besides, If a value is composed of random characters,
it will be hard to be compressed, because pglz requires a 25% compression
ratio by default or not worth it.
This means that querying the value will not trigger the patch. But the
first test results show that the patch
is slower than the master when the value is composed of random characters,
which is confusing.



Yes, I know. But the values have compressible and incompressible (random)
part, so in most cases the value should be compressible, altho

Re: Index Skip Scan

2019-07-10 Thread Dmitry Dolgov
> On Tue, Jul 2, 2019 at 2:27 PM David Rowley  
> wrote:
>
> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.

> On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen  
> wrote:
>
> David, are you thinking about something like the attached ?
>
> Some questions.
>
> * Do you see UniqueKey as a "complete" planner node ?
>- I didn't update the nodes/*.c files for this yet
>
> * Is a UniqueKey with a list of EquivalenceClass best, or a list of
> UniqueKey with a single EquivalenceClass

Just for me to clarify, the idea is to replace PathKeys with a new concept of
"UniqueKey" for skip scans, right? If I see it correctly, of course

UniqueKeys { x, y } == UniqueKeys { y, x }

from the result point of view, but the execution costs could be different due
to different values distribution. In fact there are efforts to utilize this to
produce more optimal order [1], but with UniqueKeys concept this information is
lost. Obviously it's not something that could be immediately (or even never) a
problem for skip scan functionality, but I guess it's still worth to point it
out.

> On Wed, Jul 10, 2019 at 4:40 PM Floris Van Nee  
> wrote:
>
> I verified that the backwards index scan is indeed functioning now. However,
> I'm afraid it's not that simple, as I think the cursor case is broken now.

Thanks for testing! Could you provide a test case to show what exactly is the
problem?

[1]: 
https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru




Re: Index Skip Scan

2019-07-10 Thread Floris Van Nee

> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 1) ks);
create index on a (a,b);
analyze a;

set enable_indexskipscan=1; // setting this to 0 yields different results 
set random_page_cost=1;
explain SELECT DISTINCT ON (a) a,b FROM a;

BEGIN;
DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;

FETCH FROM c;
FETCH BACKWARD FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

END;




Re: range_agg

2019-07-10 Thread Paul Jungwirth

On 7/9/19 11:24 PM, David Fetter wrote:

I seem to recall that the usual convention (at least in math) is
to use intervals that are generally represented as open on the
infinity side, but that might not fit how we do things.


I think it does, unless I'm misunderstanding?


Oh, I was just wondering about the square bracket on the left side of
[null, 1).  It's not super important.


Ah, I understand now. Just a typo on my part. Thanks for catching it, 
and sorry for the confusion!



!mr , perhaps?


I like that suggestion. Honestly I'm not sure we even want an inverse,
but it's so important theoretically we should at least consider
whether it is appropriate here. Or maybe "inverse" is the wrong word
for this, or there is a different meaning it should have.


Jeff's suggestion of ~ for complement is better.


Okay, thanks. I like it better too.

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: Index Skip Scan

2019-07-10 Thread Dmitry Dolgov
> On Wed, Jul 10, 2019 at 4:52 PM Floris Van Nee  
> wrote:
>
> > Thanks for testing! Could you provide a test case to show what exactly is 
> > the
> > problem?
>
> create table a (a int, b int, c int);
> insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
> generate_series(1, 1) ks);
> create index on a (a,b);
> analyze a;
>
> set enable_indexskipscan=1; // setting this to 0 yields different results
> set random_page_cost=1;
> explain SELECT DISTINCT ON (a) a,b FROM a;
>
> BEGIN;
> DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;
>
> FETCH FROM c;
> FETCH BACKWARD FROM c;
>
> FETCH 6 FROM c;
> FETCH BACKWARD 6 FROM c;
>
> FETCH 6 FROM c;
> FETCH BACKWARD 6 FROM c;
>
> END;

Ok, give me a moment, I'll check.




Re: Index Skip Scan

2019-07-10 Thread Floris Van Nee

> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

Note that in the case of a regular non-skip scan, this cursor backwards works 
because the Unique node on top does not support backwards scanning at all. 
Therefore, when creating the cursor, the actual plan actually contains a 
Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards' 
never reaches the the index scan therefore, as it just fetches stuff from the 
materialize node.

-Floris



Re: Index Skip Scan

2019-07-10 Thread Dmitry Dolgov
> On Wed, Jul 10, 2019 at 5:00 PM Floris Van Nee  
> wrote:
>
>
> > Thanks for testing! Could you provide a test case to show what exactly is 
> > the
> > problem?
>
> Note that in the case of a regular non-skip scan, this cursor backwards works
> because the Unique node on top does not support backwards scanning at all.
> Therefore, when creating the cursor, the actual plan actually contains a
> Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards'
> never reaches the the index scan therefore, as it just fetches stuff from the
> materialize node.

Yeah, I'm aware. The last time when I was busy with cursors I've managed to
make it work as I wanted, so at that time I decided to keep it like that, even
though without skip scan it wasn't doing backwards.




Re: pg_receivewal documentation

2019-07-10 Thread Jesper Pedersen

Hi,

On 7/10/19 10:24 AM, Alvaro Herrera wrote:

+1 to document this caveat.

How about
 Note that while WAL will be flushed with this setting,
 pg_receivewal never applies it, so
  must not be set to
 remote_apply if 
pg_receivewal
 is the only synchronous standby.
?



Sure.

Best regards,
 Jesper
>From cc51d52b2f67b33cd0faba1a74e05e93fc859011 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authors: Álvaro Herrera, Laurenz Albe and Jesper Pedersen
Review-by: Álvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 8 
 1 file changed, 8 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..16f9c41ff1 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,14 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it, so
+ must not be set to
+remote_apply if pg_receivewal
+is the only synchronous standby.
+   
   
  
 
-- 
2.21.0



Re: FETCH FIRST clause PERCENT option

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-08, Surafel Temesgen wrote:

> Hi Thomas,
> Thank you for informing me
> 
> Hi Surafel,
> >
> > There's a call to adjust_limit_rows_costs() hiding under
> > contrib/postgres_fdw, so this fails check-world.
> >
> 
> Fixed . I also include the review given by Ryan in attached patch

What's with the new tuplestore function for getting heap tuples?  That
looks really odd.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-10 Thread Ashwin Agrawal
On Wed, Jul 10, 2019 at 6:46 AM Tom Lane  wrote:

> Michael Paquier  writes:
> > On Wed, Jul 10, 2019 at 12:51:28PM +0530, Amit Kapila wrote:
> >> It would be good if we can come up with something like that.  It will
> >> be helpful for zheap, where in some cases we get different row
> >> ordering due to in-place updates.  As of now, we try to add Order By
> >> or do some extra magic to get consistent row ordering.
>
> > That was an issue for me as well when working with Postgres-XC when
> > the row ordering was not guaranteed depending on the number of nodes
> > (speaking of which Greenplum has the same issues, no?).  Adding ORDER
> > BY clauses to a set of tests may make sense, but then this may impact
> > the plans generated for some of them..
>
> Yeah, I do not want to get into a situation where we can't test
> queries that lack ORDER BY.  Also, the fact that tableam X doesn't
> reproduce heap's row ordering is not a good reason to relax the
> strength of the tests for heap.  So I'm wondering about some
> postprocessing that we could optionally apply.  Perhaps the tools
> Melanie mentions could help.
>

Surprisingly, I have been working from a couple of days to use those
Perl tools from Greenplum for Zedstore. As for Zedstore plans differ
for many regress tests because relation size not being the same as
heap and all. Also, for similar reasons, row orders change as
well. So, to effectively use the test untouched to validate Zedstore
and yes was thinking will help Zheap testing as well. I also tested
the same for regressplans.sh and it will lift a lot of manual burden
of investigating the results. As one can specify to completely ignore
explain plan outputs from the comparison between results and
expected. Will post patch for the tool, once I get in little decent
shape.


Re: pg_checksums (or checksums in general) vs tableam

2019-07-10 Thread Magnus Hagander
On Wed, Jul 10, 2019 at 3:05 PM Michael Paquier  wrote:

> On Wed, Jul 10, 2019 at 11:42:34AM +0200, Magnus Hagander wrote:
> > pg_checksums enumerate the files. What if there are files there from a
> > different tableam? Isn't pg_checksums just going to badly fail then,
> since
> > it assumes everything is heap?
> >
> > Also, do we allow AMs that don't support checksumming data? Do we have
> any
> > checks for tables created with such AMs in a system that has checksums
> > enabled?
>
> Table AMs going through shared buffers and smgr.c, like zedstore,
> share the same page header, meaning that the on-disk file is the same
> as heap, and that checksums are compiled similarly to heap.
> pg_checksums is not going to complain on those ones and would work
> just fine.


> Table AMs using their own storage layer (which would most likely use
> their own checksum method normally?) would be ignored by pg_checksums
> if the file names don't match what smgr uses, but it could result in
> failures if they use on-disk file names which match.
>

That would be fine, if we actually knew. Should we (or have we already?)
defined a rule that they are not allowed to use the same naming standard
unless they have the same type of header?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_checksums (or checksums in general) vs tableam

2019-07-10 Thread Andres Freund
Hi,

On July 10, 2019 9:12:18 AM PDT, Magnus Hagander  wrote:
>On Wed, Jul 10, 2019 at 3:05 PM Michael Paquier 
>wrote:
>
>> On Wed, Jul 10, 2019 at 11:42:34AM +0200, Magnus Hagander wrote:
>> > pg_checksums enumerate the files. What if there are files there
>from a
>> > different tableam? Isn't pg_checksums just going to badly fail
>then,
>> since
>> > it assumes everything is heap?
>> >
>> > Also, do we allow AMs that don't support checksumming data? Do we
>have
>> any
>> > checks for tables created with such AMs in a system that has
>checksums
>> > enabled?
>>
>> Table AMs going through shared buffers and smgr.c, like zedstore,
>> share the same page header, meaning that the on-disk file is the same
>> as heap, and that checksums are compiled similarly to heap.
>> pg_checksums is not going to complain on those ones and would work
>> just fine.
>
>
>> Table AMs using their own storage layer (which would most likely use
>> their own checksum method normally?) would be ignored by pg_checksums
>> if the file names don't match what smgr uses, but it could result in
>> failures if they use on-disk file names which match.
>>
>
>That would be fine, if we actually knew. Should we (or have we
>already?)
>defined a rule that they are not allowed to use the same naming
>standard
>unless they have the same type of header?

No, don't think we have already. There's the related problem of what to include 
in base backups, too.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 08:31:17AM -0400, Joe Conway wrote:
> Please see my other reply (and
> https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
> appendix C as pointed out by Ryan downthread).
> 
> At least in my mind, I trust a published specification from the
> nation-state level over random blogs or wikipedia. If we can find some
> equivalent published standards that contradict NIST we should discuss
> it, but for my money I would prefer to stick with the NIST recommended
> method to produce the IVs.

So, we have had a flurry of activity on this thread in the past day, so
let me summarize:

*  Using the database oid does make the nonce unique in the cluster as
long as we re-encrypt when we do CREATE DATABASE.  We can avoid some of
that by not encrypting template1, but we have the WITH TEMPLATE option
to use other databases as templates, so we might as well always just
decrypt/re-encrypted.

*  However, the page will be rewritten many times, so if we use just
pg_database/pg_class/page-offset for the nonce, we are re-encrypting
with the same nonce/IV for multiple page values, which is a security
issue.

*  Using the LSN as part of the nonce fixes both problems, and has a
   third benefit:

*  We don't need to decrypt/re-encrypt during CREATE DATABASE since
   the page contents are the same in both places, and once one
   database changes its pages, it gets a new LSN, and hence a new
   nonce/IV.

*  For each change of an 8k page, you get a new nonce/IV, so you
   are not encrypting different data with the same nonce/IV

*  This avoids requiring pg_upgrade to preserve database oids.

*  It was determined that running known values like pg_class.oid, LSN,
page-number to create a nonce and running those through the encryption function
to create an IV is sufficient.

However, the LSN must then be visible on the encrypted pages.  I would
like to avoid having different page formats for encrypted and
non-encrypted pages, because if we require additional storage for
encrypted pages (like adding a random number), existing non-encrypted
pages might not be able to fit in the encrypted format, causing
complexity when accessing them and when converting tables to encrypted
format.

Looking at the page header, I see:

typedef struct PageHeaderData
{
/* XXX LSN is member of *any* block, not only page-organized ones */
PageXLogRecPtr pd_lsn;  /* LSN: next byte after last byte of 
xlog
 * record for last change to this page 
*/
uint16  pd_checksum;/* checksum */
uint16  pd_flags;   /* flag bits, see below */
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */

pd_lsn/PageXLogRecPtr is 8 bytes.  (We might only want to use the low
order bits for the nonce.)  LocationIndex is 16 bits, meaning that the
four fields listed above are 16-bytes wide, which is the width of the
typical AES cipher mode block.  I suggest we _not_ encrypt the first 16
bytes of each 8k page, and start encrypting at byte 17 --- that way,
these values are visible and can be used as part of the nonce to create
an IV.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: POC: Cleaning up orphaned files using undo logs

2019-07-10 Thread Robert Haas
On Wed, Jul 10, 2019 at 2:32 AM Amit Kapila  wrote:
> As of now, after we finish executing the rollback actions, the entry
> from the hash table is removed.  Now, at a later time (when queues are
> full and we want to insert a new entry) when we access the queue entry
> (to check whether we can remove it)  corresponding to the removed hash
> table entry, will it be safe to access it?  The hash table entry might
> have been freed or would have been reused as some other entry by the
> time we try to access it.

Hmm, yeah, that's a problem.  I think we could possibly fix this by
having the binary heaps just store a FullTransactionId rather than a
pointer to the RollBackHashEntry.  Then, if you get a
FullTransactionId from the binary heap, you just do a hash table
lookup to find the RollBackHashEntry instead of accessing it directly.
  If it doesn't exist, then you can just discard the entry: it's for
some old transaction that's no longer relevant.

However, there are a few problems with that idea. One is that I see
that you've made the hash table keyed by full_xid + start_urec_ptr
rather than just full_xid, so if the queues just point to an XID, it's
not enough to find the hash table entry.  The comment claims that this
is necessary because "in the same transaction, there could be rollback
requests for both logged and unlogged relations," but I don't
understand why that means we need start_urec_ptr in the hash table
key. It would seem more natural to me to have a single entry that
covers both the logged and the unlogged undo for that transaction.

(Incidentally, I don't think it's correct that RollbackHashEntry
starts with FullTransactionId full_xid + UndoRecPtr start_uprec_ptr
declared separately; I think it should start with RollbackHashKey -
although if we change the key back to just a FullTransactionId then we
don't need to worry separately about fixing this issue.)

Another problem is that on a 64-bit system, we can pass a
FullTransactionId by value, but on a 32-bit system we can't. That's
awkward, because if we can't pass the XID by value, then we're back to
needing a separately-allocated structure for the queue entries, which
I was really hoping to avoid.

A second possible approach to this problem is to just reset all the
binary heaps (using binaryheap_reset) whenever we insert a new entry
into the hash table, and rebuild them the next time they're needed by
reinserting all of the current entries in the hash table. That might
be too inefficient.  You can insert a bunch of things in a row without
re-heaping, and you can dequeue a bunch of things in a row without
re-heaping, but if they alternate you'll re-heap a lot. I don't know
whether that costs enough to worry about; it might be fine.

A third possible approach is to allocate a separate array whose
entries are reused, and to maintain a freelist of entries from that
array.  All the real data is stored in this array, and the binary
heaps and hash table entries just point to it.  When the freelist is
empty, the next allocate scans all the binary heaps and removes any
pointers to inactive entries; it then puts all inactive entries back
onto the freelist. This is more complex than the previous approach,
and it doesn't totally avoid re-heaping, because removing pointers to
inactive entries from the binary heaps will necessitate a re-heap on
next access. However, if the total capacity of the data structures is
large compared to the number of entries actually in use, which will
usually be true, we'll have to re-heap much less often, because we
only have to do it when the number of allocations exhausts
*everything* on the free-list, rather than after every allocation.

A fourth possible approach is to enhance the simplehash mechanism to
allow us to do cleanup when an item to which there might still be
residual pointers is reused. We could allow some code supplied by the
definer of an individual simplehash implementation to be executed
inside SH_INSERT, just at the point where we're going to make an entry
status SH_STATUS_IN_USE.  What we'd do is add a flag to the structure
indicating whether there might be deferred cleanup work for that
entry.  Maybe it would be called something like 'bool processed' and
set when we process the undo work for that entry.  If, when we're
about to reuse an entry, that flag is set, then we go scan all the
binary heaps and remove all entries for which that flag is set.  And
then we unset the flag for all of those entries. Like the previous
approach, this is basically a refinement of the second approach in
that it tries to avoid re-heaping too often. Here, instead of
re-heaping once we've been through the entire free-list, we'll re-heap
when we happen (more or less randomly) happen to reuse a hash table
entry that's been reused, but we avoid it when we happen to snag a
hash table entry that hasn't been reused recently.  This is probably
less efficient at avoiding re-heaping than the previous approach, but
it avoids a separately-a

Re: [proposal] de-TOAST'ing using a iterator

2019-07-10 Thread Binguo Bao
This is the patch that fix warnings.

Best Regards,
Binguo Bao

Binguo Bao  于2019年7月10日周三 下午10:18写道:

> Hi Thomas,
> I've fixed the warnings.
>
> Thomas Munro  于2019年7月5日周五 下午12:21写道:
>
>> On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao  wrote:
>> > Hi hackers!
>> > This proposal aims to provide the ability to de-TOAST a fully TOAST'd
>> and compressed field using an iterator and then update the appropriate
>> parts of the code to use the iterator where possible instead of
>> de-TOAST'ing and de-compressing the entire value. Examples where this can
>> be helpful include using position() from the beginning of the value, or
>> doing a pattern or substring match.
>> >
>> > de-TOAST iterator overview:
>> > 1. The caller requests the slice of the attribute value from the
>> de-TOAST iterator.
>> > 2. The de-TOAST iterator checks if there is a slice available in the
>> output buffer, if there is, return the result directly,
>> > otherwise goto the step3.
>> > 3. The de-TOAST iterator checks if there is the slice available in the
>> input buffer, if there is, goto step44. Otherwise,
>> > call fetch_datum_iterator to fetch datums from disk to input buffer.
>> > 4. If the data in the input buffer is compressed, extract some data
>> from the input buffer to the output buffer until the caller's
>> > needs are met.
>> >
>> > I've implemented the prototype and apply it to the position() function
>> to test performance.
>>
>> Hi Binguo,
>>
>> Interesting work, and nice performance improvements so far.  Just by
>> the way, the patch currently generates warnings:
>>
>> https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719
>>
>> --
>> Thomas Munro
>> https://enterprisedb.com
>>
>
From ea2eb65049e2da984f0a2c3475f7f4f92adc98ab Mon Sep 17 00:00:00 2001
From: BBG 
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 491 +++
 src/backend/utils/adt/varlena.c  |  48 ++--
 src/include/access/tuptoaster.h  |  92 +++
 3 files changed, 615 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..8f7faf6 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+	 DetoastIterator iter, int32 length);
 
 
 /* --
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* --
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * --
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf 

buildfarm's typedefs list has gone completely nutso

2019-07-10 Thread Tom Lane
The current HEAD typedefs list available from
https://buildfarm.postgresql.org/cgi-bin/typedefs.pl
has the following interesting additions compared to where
things were on July 1:

2
ECPGt_bytea
connection_name
in_addr
pg_fprintf
send_appname

The "2" in particular is causing seriously bad pgindent results for me.
But as far as I can tell, none of these have any justification being
marked as a typedef.

calliphoridae seems to be contributing the "2" and "pg_fprintf".
I didn't track down the rest (but calliphoridae is not to blame).

Was there any change in calliphoridae's toolchain this month?

regards, tom lane




Re: (select query)/relation as first class citizen

2019-07-10 Thread Dent John
Hi Roman, Pavel,

I was interested in this post, as it’s a topic I’ve stumbled upon in the past. 

There are two topics at play here:

1. The ability to flexibly craft queries from procedural language functions

2. Support for pipelined access to SETOF/TABLEs from procedural language 
functions

Postgres has parameterised functions that can return a query and optimise it in 
context of the overall statement, but only for the SQL language. Absence of 
support for other languages is a curious gap.

And it leads to fact that, even in presence of only static parameters, only one 
“shape” of query can ever be returned. (No IF/ELSE IF logic can alter the 
query, unless you dive down the rat hole of encoding it into the query itself.) 
So that is another gap.

Postgres has some relevant first class types: TABLE/SETOF and REFCURSOR. 
TABLE/SETOF are output only, materialised always and optimiser fences. Current 
syntax supports pipelined output (via RETURN NEXT), and docs call out the fact 
that it might in future not be materialised. I suspect an executor change is 
needed to support it, as well as plpgsql change. 

Their output-only nature is an odd gap. REFCURSOR is not materialised, and is 
also input-capable. If SETOF/TABLE were made both, then there would be a 
curious type system duplication. 

However REFCURSOR is pretty awkward to use from SQL. The fact you can’t cast or 
convert it to a SETOF/TABLE and SELECT FROM a REFCURSOR in native SQL is weird, 
and a gap, IMHO.

On the input aide, REFCURSOR is neat. Despite the above limitation, it can 
become bound to a query before being OPENed for execution and fetching. If only 
the optimiser could “see” that pre-OPENed state, as with parameterised views, 
then, in principle, there would be nothing stopping some other outer function 
consuming it, SELECTing FROM it, and perhaps even returning a new query, and 
then the optimiser would be able to see and optimise the final global 
statement. Okay: this is a biggie, but it’s still a gap, in my view. 

So my view is that Postgres already has types that are close to what is asked 
for. It also has tools that look ripe to be plumbed together. Problem is, when 
they are combined, they don’t fit well, and when they are made to fit, the 
fence, materialisation always and curious output-only nature leads developers 
to create un-performant messes. :-)

I think some of this could be fixed quite easily. The executor already 
(obviously) can pipeline. PLs can’t today save and restore their context to 
support pipelining, but it is not impossible. REFCURSOR can’t be cast to a 
TABLE/SETOF, not meaningfully be SELECTed FROM, but that can’t be too hard 
either.

Exposing the pre-OPENed query for optimisation is another thing. But here 
again, I see it as a challenge of mental gymnastics rather than actually hard 
in terms of code factoring — much of what is needed is surely already there in 
the way of VIEW rewriting. 

Regarding demand for the #2 feature set, this somewhat dated tread is 
suggestive of a niche use case: 
https://www.postgresql.org/message-id/flat/005701c6dc2c%2449011fc0%240a00a8c0%40trivadis.com.

d.

> On 8 Jul 2019, at 10:19, Pavel Stehule  wrote:
> 
> Hi
> 
> po 8. 7. 2019 v 9:33 odesílatel Roman Pekar  napsal:
>> Hi,
>> 
>> what do you think about this idea in general? If you don't have to think 
>> about implementation for now? From my point of view writing Sql queries is 
>> very close to how functional language work if you treat "select" queries as 
>> functions without side-effects, and having query being first-class-citizen 
>> could move this even further.
> 
> first - please, don't send top posts.
> 
> second - my opinion is not clear. I can imagine benefits - on second hand, 
> the usage is relative too close to one antipattern - only one query wrapped 
> by functions. I see your proposal as little bit more dynamic (with little bit 
> different syntax) views. 
> 
> With my experience I really afraid about it - it can be very effective (from 
> developer perspective) and very slow (from customer perspective). This is 
> example of tool that looks nice on paper, but can be very badly used.
> 
> Maybe I am not the best man for this topic - I like some functional 
> programming concepts, but I use it locally - your proposal moves SQL to some 
> unexplored areas - and I think so it can be interesting as real research 
> topic, but not today Postgres's theme.
> 
> The basic question is why extend SQL and don't use some native functional 
> language. Postgres should to implement ANSI SQL - and there is not a space 
> for big experiments. I am sceptic about it - relational databases are static, 
> SQL is static language, so it is hard to implement some dynamic system over 
> it - SQL language is language over relation algebra - it is not functional 
> language, I afraid so introduction another concept to this do more bad than 
> good.
> 
> Regards
> 
> Pavel
>  
> 
> 
>> 
>> Regards,
>> Roman
>> 
>>> On 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-10, Bruce Momjian wrote:

> *  Using the LSN as part of the nonce fixes both problems, and has a
>third benefit:
> 
> *  We don't need to decrypt/re-encrypt during CREATE DATABASE since
>the page contents are the same in both places, and once one
>database changes its pages, it gets a new LSN, and hence a new
>nonce/IV.
> 
> *  For each change of an 8k page, you get a new nonce/IV, so you
>are not encrypting different data with the same nonce/IV
> 
> *  This avoids requiring pg_upgrade to preserve database oids.

An ignorant question -- what is it that gets stored in the page for
decryption use, the nonce or the IV derived from it?  I think if you
want to store the nonce, you'd have to store the database OID, because
otherwise how do you know which database OID to use to determine the
full nonce after cloning a database?  You already have the table OID in
the catalog and the LSN in the page header, so you're only missing the
database OID.  (Assuming you make the nonce be database OID || relation
OID || page LSN)

Also, how are key changes handled?  Do we need to store a key identifier
in each page?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: buildfarm's typedefs list has gone completely nutso

2019-07-10 Thread Andres Freund
Hi,

On 2019-07-10 12:57:08 -0400, Tom Lane wrote:
> The current HEAD typedefs list available from
> https://buildfarm.postgresql.org/cgi-bin/typedefs.pl
> has the following interesting additions compared to where
> things were on July 1:
> 
>   2
>   ECPGt_bytea
>   connection_name
>   in_addr
>   pg_fprintf
>   send_appname

Huh.


> The "2" in particular is causing seriously bad pgindent results for
> me.

I haven't run pgindent, but I certainly can imagine...


> But as far as I can tell, none of these have any justification being
> marked as a typedef.
> 
> calliphoridae seems to be contributing the "2" and "pg_fprintf".
> I didn't track down the rest (but calliphoridae is not to blame).

> Was there any change in calliphoridae's toolchain this month?

Hm, it has gotten gcc-9 installed recently, but calliphoridae isn't
using that. So it's probably not the compiler side. But I also see a
binutils upgrade:

2019-07-08 06:22:48 upgrade binutils-multiarch:amd64 2.31.1-16 
2.32.51.20190707-1

and corresponding upgrades forall the arch specific packages. I suspect
it might be that.

I can't immediately reproduce that locally though, using the same
version of binutils. It's somewhat annoying that the buildfarm uses a
different form of computing the typedefs than src/tools/find_typedef ...

Greetings,

Andres Freund




Re: buildfarm's typedefs list has gone completely nutso

2019-07-10 Thread Tom Lane
Andres Freund  writes:
> On 2019-07-10 12:57:08 -0400, Tom Lane wrote:
>> Was there any change in calliphoridae's toolchain this month?

> Hm, it has gotten gcc-9 installed recently, but calliphoridae isn't
> using that. So it's probably not the compiler side. But I also see a
> binutils upgrade:
> 2019-07-08 06:22:48 upgrade binutils-multiarch:amd64 2.31.1-16 
> 2.32.51.20190707-1
> and corresponding upgrades forall the arch specific packages. I suspect
> it might be that.

Yeah, a plausible theory is that the output format changed enough
to confuse our typedef-symbol-scraping code.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 01:04:47PM -0400, Alvaro Herrera wrote:
> On 2019-Jul-10, Bruce Momjian wrote:
> 
> > *  Using the LSN as part of the nonce fixes both problems, and has a
> >third benefit:
> > 
> > *  We don't need to decrypt/re-encrypt during CREATE DATABASE since
> >the page contents are the same in both places, and once one
> >database changes its pages, it gets a new LSN, and hence a new
> >nonce/IV.
> > 
> > *  For each change of an 8k page, you get a new nonce/IV, so you
> >are not encrypting different data with the same nonce/IV
> > 
> > *  This avoids requiring pg_upgrade to preserve database oids.
> 
> An ignorant question -- what is it that gets stored in the page for
> decryption use, the nonce or the IV derived from it?  I think if you
> want to store the nonce, you'd have to store the database OID, because
> otherwise how do you know which database OID to use to determine the
> full nonce after cloning a database?  You already have the table OID in
> the catalog and the LSN in the page header, so you're only missing the
> database OID.  (Assuming you make the nonce be database OID || relation
> OID || page LSN)

You are right that if you used the database oid in the nonce, you would
need to decrypt/re-encrypt the data during CREATE DATABASE, or store
the original database oid in the page.

The new approach is that a single key would be used for all databases
and the WAL, and use the LSN instead of the database oid, so there is no
need to know which database originally encrypted the page --- any
database can decrypt it.

> Also, how are key changes handled?  Do we need to store a key identifier
> in each page?

Uh, we have not started discussing that yet.  I am thinking we might
need to store the key identifier in the pg_class table and then create a
command to re-encrypt tables.  We can re-key a page at a time, but we
would still need to know when all pages/tables are no longer using the
old key, so doing it just at the table/index level seems appropriate.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-07-10 Thread Anastasia Lubennikova

23.04.2019 14:08, Anastasia Lubennikova wrote:
I'm volunteering to write a draft patch or, more likely, set of 
patches, which

will allow us to discuss the subject in more detail.
And to do that I wish we agree on the API and data format (at least 
broadly).
Looking forward to hearing your thoughts. 


Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file 
format and protocol extension,

which is necessary to start implementing the feature.

- Proof Of Concept patch -

In attachments, you can find a prototype of incremental pg_basebackup, 
which consists of 2 features:


1) To perform incremental backup one should call pg_basebackup with a 
new argument:


pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'

where lsn is a start_lsn of parent backup (can be found in 
"backup_label" file)


It calls BASE_BACKUP replication command with a new argument 
PREV_BACKUP_START_LSN 'lsn'.


For datafiles, only pages with LSN > prev_backup_start_lsn will be 
included in the backup.
They are saved into 'filename.partial' file, 'filename.blockmap' file 
contains an array of BlockNumbers.
For example, if we backuped blocks 1,3,5, filename.partial will contain 
3 blocks, and 'filename.blockmap' will contain array {1,3,5}.


Non-datafiles use the same format as before.

2) To merge incremental backup into a full backup call

pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' 
--merge-backups


It will move all files from 'incremental_basedir' to 'basedir' handling 
'.partial' files correctly.



- Questions to discuss -

Please note that it is just a proof-of-concept patch and it can be 
optimized in many ways.

Let's concentrate on issues that affect the protocol or data format.

1) Whether we collect block maps using simple "read everything page by 
page" approach
or WAL scanning or any other page tracking algorithm, we must choose a 
map format.

I implemented the simplest one, while there are more ideas:

- We can have a map not per file, but per relation or maybe per tablespace,
which will make implementation more complex, but probably more optimal.
The only problem I see with existing implementation is that even if only 
a few blocks changed,

we still must pad it to 512 bytes per tar format requirements.

- We can save LSNs into the block map.

typedef struct BlockMapItem {
    BlockNumber blkno;
    XLogRecPtr lsn;
} BlockMapItem;

In my implementation, invalid prev_backup_start_lsn means fallback to 
regular basebackup
without any block maps. Alternatively, we can define another meaning of 
this value and send a block map for all files.

Backup utilities can use these maps to speed up backup merge or restore.

2) We can implement BASE_BACKUP SEND_FILELIST replication command,
which will return a list of filenames with file sizes and block maps if 
lsn was provided.


To avoid changing format, we can simply send tar headers for each file:
- tarHeader("filename.blockmap") followed by blockmap for relation files 
if prev_backup_start_lsn is provided;
- tarHeader("filename") without actual file content for non relation 
files or for all files in "FULL" backup


The caller can parse messages and use them for any purpose, for example, 
to perform a parallel backup.


Thoughts?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13e0d23..e757bba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10459,7 +10459,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			ti->oid = pstrdup(de->d_name);
 			ti->path = pstrdup(buflinkpath.data);
 			ti->rpath = relpath ? pstrdup(relpath) : NULL;
-			ti->size = infotbssize ? sendTablespace(fullpath, true) : -1;
+			ti->size = infotbssize ? sendTablespace(fullpath, true, InvalidXLogRecPtr) : -1;
 
 			if (tablespaces)
 *tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c2978a9..3560da1 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -41,6 +41,7 @@
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/timestamp.h"
+#include "utils/pg_lsn.h"
 
 
 typedef struct
@@ -52,13 +53,22 @@ typedef struct
 	bool		includewal;
 	uint32		maxrate;
 	bool		sendtblspcmapfile;
+	XLogRecPtr	prev_backup_start_lsn;
 } basebackup_options;
 
 
 static int64 sendDir(const char *path, int basepathlen, bool sizeonly,
-	 List *tablespaces, bool sendtblspclinks);
+	 List *tablespaces, bool sendtblspclinks, XLogRecPtr prev_backup_start_lsn);
 static bool sendFile(const char *readfilename, const char *tarfilename,
-	 struct stat *statbuf, bool missing_ok, Oid dboid);
+	 struct stat *statbuf, bool missing_ok, Oid dboid

Re: doc: minor update for description of "pg_roles" view

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 02:35:56PM +0900, Ian Barwick wrote:
> Hi
> 
> Here:
> 
>   https://www.postgresql.org/docs/12/view-pg-roles.html
> 
> we state:
> 
>   "This view explicitly exposes the OID column of the underlying table,
>since that is needed to do joins to other catalogs."
> 
> I think it's superfluous to mention this now OIDs are exposed by default;
> attached patch (for REL_12_STABLE and HEAD) simply removes this sentence.

Patch applied though PG 12.  Thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Ryan Lambert
> what is it that gets stored in the page for
> decryption use, the nonce or the IV derived from it?


I believe storing the IV is preferable and still secure per [1]: "The IV
need not be secret"

Beyond needing the database oid, if every decrypt function has to
regenerate the IV from the nonce that will affect performance.  I don't
know how expensive the forward hash is but it won't be free.

[1]
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf



*Ryan Lambert*


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Stephen Frost
Greetings,

* Ryan Lambert (r...@rustprooflabs.com) wrote:
> > what is it that gets stored in the page for
> > decryption use, the nonce or the IV derived from it?
> 
> I believe storing the IV is preferable and still secure per [1]: "The IV
> need not be secret"
> 
> Beyond needing the database oid, if every decrypt function has to
> regenerate the IV from the nonce that will affect performance.  I don't
> know how expensive the forward hash is but it won't be free.

Compared to the syscall and possible disk i/o required, I'm not sure
that's something we really need to try to optimize for, particularly if
we could store something more generally useful (like the LSN) in that
little bit of space that's available in each page.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 12:38:02PM -0600, Ryan Lambert wrote:
> 
> what is it that gets stored in the page for
> decryption use, the nonce or the IV derived from it?
> 
> 
> I believe storing the IV is preferable and still secure per [1]: "The IV need
> not be secret"
> 
> Beyond needing the database oid, if every decrypt function has to regenerate
> the IV from the nonce that will affect performance.  I don't know how 
> expensive
> the forward hash is but it won't be free.

Well, I think we have three options.  We have 3 4-byte integers
(pg_class.oid, LSN, page-number) that could be concatenated to be the
IV, we could run those through a hash, or we could run them through the
encryption function with the secret.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Wed, Jul 10, 2019 at 12:38:02PM -0600, Ryan Lambert wrote:
> > 
> > what is it that gets stored in the page for
> > decryption use, the nonce or the IV derived from it?
> > 
> > 
> > I believe storing the IV is preferable and still secure per [1]: "The IV 
> > need
> > not be secret"
> > 
> > Beyond needing the database oid, if every decrypt function has to regenerate
> > the IV from the nonce that will affect performance.  I don't know how 
> > expensive
> > the forward hash is but it won't be free.
> 
> Well, I think we have three options.  We have 3 4-byte integers
> (pg_class.oid, LSN, page-number) that could be concatenated to be the
> IV, we could run those through a hash, or we could run them through the
> encryption function with the secret.

I didn't see where it was said that using a hash was a good idea in this
context..?  Encrypting it with the key looked like it was discussed as a
viable option.  I had understood that part of the point of using the
table OID and page-number was also so that we didn't have to explicitly
store the result, therefore requiring us to need less space on the page
to make this happen.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-10 Thread Tom Lane
Amit Langote  writes:
> Attached updated patch.  Thanks again.

Pushed with a bit of further cleanup --- most notably, the way
you had execute_sql_string(), it was still leaking any cruft
ProcessUtility might generate.  We can fix that by running
ProcessUtility in the per-statement context too.

I also dropped the optimization for a single/last statement in
execute_sql_string(), and simplified it to just always create
and delete a child context.  This was based on a couple of
thoughts.  The norm in this code path is that there's multiple
statements, probably lots of them, so that the percentage savings
from getting rid of one context creation is likely negligible.
Also, unlike exec_simple_query, we *don't* know that the outer
context is due to be cleared right afterwards.  Since
execute_sql_string() can run multiple times in one extension
command, in principle we could get bloat from not cleaning up
after the last command of each string.   Admittedly, it's not
likely that you'd have so many strings involved that that
amounts to a lot, but between following upgrade-script chains
and cascaded module loads, there could be more than a couple.
So it seems like the argument for saving a context creation is
much weaker here than in exec_simple_query.

I tried to improve the comments too.  I noticed that the bit about
"(again, these must outlive the execution context)" seemed to be
a dangling reference --- whatever previous comment it was referring
to is not to be found anymore.  So I made that self-contained.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 02:44:30PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Wed, Jul 10, 2019 at 12:38:02PM -0600, Ryan Lambert wrote:
> > > 
> > > what is it that gets stored in the page for
> > > decryption use, the nonce or the IV derived from it?
> > > 
> > > 
> > > I believe storing the IV is preferable and still secure per [1]: "The IV 
> > > need
> > > not be secret"
> > > 
> > > Beyond needing the database oid, if every decrypt function has to 
> > > regenerate
> > > the IV from the nonce that will affect performance.  I don't know how 
> > > expensive
> > > the forward hash is but it won't be free.
> > 
> > Well, I think we have three options.  We have 3 4-byte integers
> > (pg_class.oid, LSN, page-number) that could be concatenated to be the
> > IV, we could run those through a hash, or we could run them through the
> > encryption function with the secret.
> 
> I didn't see where it was said that using a hash was a good idea in this
> context..?  Encrypting it with the key looked like it was discussed as a

I didn't either, except it was referenced above as "forward hash".  I
don't know why that was suggested, which is why I listed it as an
option/suggestion.

> viable option.  I had understood that part of the point of using the

Agreed.

> table OID and page-number was also so that we didn't have to explicitly
> store the result, therefore requiring us to need less space on the page
> to make this happen.

Yep!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Ryan Lambert
> I didn't either, except it was referenced above as "forward hash".  I
> don't know why that was suggested, which is why I listed it as an
> option/suggestion.

My bad, sorry for the confusion!  I meant to say "cipher" not "hash".  I
was (trying to) refer to the method of generating unpredictable IV from
nonces using the forward cipher function and the encryption key.
Too many closely related words with very specific meanings.

Ryan



>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Stephen Frost
Greetings,

* Ryan Lambert (r...@rustprooflabs.com) wrote:
> > I didn't either, except it was referenced above as "forward hash".  I
> > don't know why that was suggested, which is why I listed it as an
> > option/suggestion.
> 
> My bad, sorry for the confusion!  I meant to say "cipher" not "hash".  I
> was (trying to) refer to the method of generating unpredictable IV from
> nonces using the forward cipher function and the encryption key.
> Too many closely related words with very specific meanings.

No worries, just want to try and be clear on these things..  Too easy to
mistakenly think that doing this very-similar-thing will be as secure as
doing the recommended-thing (particularly when the recommended-thing is
a lot harder...), and we don't want to end up doing that and then
discovering it isn't actually secure..

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-10 Thread Robert Haas
On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar  wrote:
> PFA, updated patch version which includes
> - One defect fix in undo interface related to undo page compression
> for handling persistence level
> - Implemented pending TODO optimization in undo page compression.
> - One defect fix in undo processing related to the prepared transaction

Looking at 0002 a bit, it seems to me that you really need to spend
some energy getting things into a consistent order all across the
patch.  For example, UndoPackStage uses the ordering: HEADER,
TRANSACTION, RMID, RELOID, XID, CID...  But the declarations of the
UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK,
BLKPREV... The comments defining those go in a different order and
some of them are missing. The definition of the UndoRecordBlah
structures go in a different order still: Transaction, Block,
LogSwitch, Payload.  UndoRecordHeaderSize goes with FORK, BLOCK,
BLPREV, TRANSACTION, LOGSWITCH,   That really needs to be
straightened out and made consistent.

You (still) need to rename blkprev to something more generic, as
mentioned in previous rounds of review.

I think it would be a good idea to avoid complex macros in favor of
functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE.  If
performance is a concern, it could be declared static inline, which
should be as good as a macro.

I don't like the fact that undoaccess.c has a new global,
undo_compression_info.  I haven't read the code thoroughly, but do we
really need that?  I think it's never modified (so it could just be
declared const), and I also think it's just all zeroes (so
initializing it isn't really necessary), and I also think that it's
just used for initializing other UndoCompressionInfos (so we could
just initialize them directly, either by setting the members
individually or jus zeroing them).

It seems like UndoRecordPrepareTransInfo ought to have an Assert(index
< some_limit) in the loop.

A comment in PrepareUndoInsert refers to "low switch" where it means
"log switch."

This is by no means a complete review, for which I unfortunately lack
the time at present.  Just some initial observations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_receivewal documentation

2019-07-10 Thread Laurenz Albe
On Wed, 2019-07-10 at 17:04 +0900, Michael Paquier wrote:
> Hmm.  synchronous_commit is user-settable, which means that it is
> possible to enforce a value in the connection string doing the
> connection.  Isn't that something we had better enforce directly in
> the tool?  In this case what could be fixed is GetConnection() which
> builds the connection string parameters.

I don't follow.

Are you talking about the replication connection from pg_receivewal
to the PostgreSQL server?  That wouldn't do anything, because it is
the setting of "synchronous_commit" for an independent client
connection that is the problem:

- pg_receivewal starts a replication connection.

- It is added to "synchronous_standby_names" on the server.

- A client connects. It sets "synchronous_commit" to "remote_apply".

- If the client modifies data, COMMIT will hang indefinitely,
  because pg_receivewal will never send confirmation that it has
  applied the changes.

One alternative option I see is for pg_receivewal to confirm that
it has applied the changes as soon as it flushed them.
It would be cheating somewhat, but it would work around the problem
in a way that few people would find surprising.

Yours,
Laurenz Albe





Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-10 Thread Julien Rouhaud
Hi Alvaro,

Thanks a lot for the review

On Wed, Jul 10, 2019 at 4:15 PM Alvaro Herrera  wrote:
>
> On 2019-Jul-09, Julien Rouhaud wrote:
>
> > I finished to do a better refactoring, and ended up with this API in
> > scripts_parallel:
>
> Looking good!

Thanks!

>  I'm not sure about the "Consume" word in ConsumeIdleSlot;
> maybe "Reserve"? "Obtain"? "Get"?

Yes, Consume is maybe a little bit weird.  I wanted to point out the
make it clear that this function is actually removing a slot from the
free list, especially since there's a (historical) get_idle_slot().  I
like Reserve, but Obtain and Get are probably too ambiguous.

> Code commentary: I think the comment that sits atop the function should
> describe what the function does without getting too much in how it does
> it.  For example in ConsumeIdleSlot you have "If there are multiples
> slots, here we wait for one connection to become available if none
> already is, returning NULL if an error occured.  Otherwise, we simply
> use the only slot we have, which we know to be free." which seems like
> it should be in another comment *inside* the function; make the external
> one something like "Reserve and return a connection that is currently
> idle, waiting until one becomes idle if none is".  Maybe you can put the
> part I first quoted as a second paragraph in the comment at top of
> function and keeping the second part I quoted as first paragraph; we
> seem to use that style too.

Good point, I'll fix as you say.

> Placement: I think it's good if related functions stay together, or
> there is some other rationale for placement within the file.  I have two
> favorite approaches: one is to put all externally callable functions at
> top of file, followed by all the static helpers in the lower half of the
> file.  The other is to put each externally accessible immediately
> followed by its specific static helpers.  If you choose one of those,
> that means that SetupParallelSlots should either move upwards, or move
> downwards.  The current ordering seems a dartboard kind of thing where
> the thrower is not Green Arrow.

:)  I tried to put everything in alphabetic order as it was previously
being done, but I apparently failed again at sorting more than 3
characters.

I usually prefer to group exported functions together and static ones
together, as I find it more maintainable in the long term, so upwards
it'll be.




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 02:57:54PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Ryan Lambert (r...@rustprooflabs.com) wrote:
> > > I didn't either, except it was referenced above as "forward hash".  I
> > > don't know why that was suggested, which is why I listed it as an
> > > option/suggestion.
> > 
> > My bad, sorry for the confusion!  I meant to say "cipher" not "hash".  I
> > was (trying to) refer to the method of generating unpredictable IV from
> > nonces using the forward cipher function and the encryption key.
> > Too many closely related words with very specific meanings.
> 
> No worries, just want to try and be clear on these things..  Too easy to
> mistakenly think that doing this very-similar-thing will be as secure as
> doing the recommended-thing (particularly when the recommended-thing is
> a lot harder...), and we don't want to end up doing that and then
> discovering it isn't actually secure..

Good, so I think we all now agree we have to put the nonce
(pg_class.oid, LSN, page-number) though the cipher using the secret.  I
think Stephen is right that the overhead of this will be minimal for 8k
page writes, and for WAL, we only need to generate the IV when we start
a new 16MB segment, so again, minimal overhead.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-10, Bruce Momjian wrote:

> Good, so I think we all now agree we have to put the nonce
> (pg_class.oid, LSN, page-number) though the cipher using the secret.

Actually, why do you need the page number in the nonce?  The LSN already
distinguishes pages -- you can't have two pages with the same LSN, can
you?  (I do think you can have multiple writes of the same page with
different LSNs, if you change hint bits and don't write WAL about it,
but maybe we should force CRC enabled in encrypted tables, which I think
closes this hole?)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Stephen Frost
Greetings,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> On 2019-Jul-10, Bruce Momjian wrote:
> 
> > Good, so I think we all now agree we have to put the nonce
> > (pg_class.oid, LSN, page-number) though the cipher using the secret.
> 
> Actually, why do you need the page number in the nonce?  The LSN already
> distinguishes pages -- you can't have two pages with the same LSN, can
> you?  (I do think you can have multiple writes of the same page with
> different LSNs, if you change hint bits and don't write WAL about it,
> but maybe we should force CRC enabled in encrypted tables, which I think
> closes this hole?)

The point about the LSN not changing is definitely a very good one..  I
agree that we should require checksums to deal with that possibility.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-10 Thread Robert Haas
On Wed, Jul 10, 2019 at 12:36 PM Robert Haas  wrote:
> Broadly, you are correct to point out that you need to avoid chasing
> stale pointers, and there are a bunch of ways to accomplish that:
> approach #1 avoids using real pointers, and the rest just make sure
> that any stale pointers don't stick around long enough to cause any
> harm. There are probably also several other totally realistic
> alternatives, and I don't know for sure what is best, or how much it
> matters.

After some off-list discussion with Andres ...

Another possible approach here, which I think I like better, is to
switch from using a binary heap to using an rbtree.  That wouldn't
work well in DSM because of the way it uses pointers, but here we're
putting data in the shared memory segment so it seems like it should
work.  The idea would be to allocate an array of entries with a
freelist, and then have allocfunc and freefunc defined to push and pop
the freelist.  Unlike a binary heap, an rbtree lets us (a) do
peek-ahead in sorted order and (b) delete elements from an arbitrary
position without rebuilding anything.

If we adopt this approach, then I think a bunch of the problems we've
been talking about actually get a lot easier.  If we pull an item from
the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can
remove it from the other one cheaply, because we can store a pointer
to the RBTNode in the main object.  So then we never have any stale
pointers in any data structure, which means we don't have to have a
strategy to avoid accidentally following them.

The fact that we can peak-ahead correctly without any new code is also
very nice.  I'm still concerned that peeking ahead isn't the right
approach in general, but if we're going to do it, peeking ahead to the
actually-next-highest-priority item is a lot better than peeking ahead
to some-item-that-may-be-fairly-high-priority.

One problem which Andres spotted is that rbt_delete() can actually
move content around, so if you just cache the RBTNode returned by
rbt_insert(), it might not be the right one by the time you
rbt_delete(), if other stuff has been deleted first.  There are
several possible approaches to that problem, but one that I'm
wondering about is modifying rbt_delete_node() so that it doesn't rely
on rbt_copy_data.  The idea is that if y != z, instead of copying the
data from y to z, copy the left/right/parent pointers from z into y,
and make z's left, right, and parent nodes point to y instead.  Then
we always end up removing the correct node, which would make things
much easier for us and might well be helpful to other code that uses
rbtree as well.

Another small problem, also spotted by Andres, is that rbt_create()
uses palloc.  That seems easy to work around: just provide an
rbt_intialize() function that a caller can use instead of it wants to
initialize an already-allocated block of memory.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: coypu: "FATAL: sorry, too many clients already"

2019-07-10 Thread Rémi Zara



> Le 10 juil. 2019 à 04:09, Tom Lane  a écrit :
> 
> I don't see a really good reason to be using force_parallel_mode on
> such a low-end box, and would recommend taking that out in any case.
> If the box's SysV IPC limits can't be raised, it might be a good idea
> to restrict the maximum test parallelism.  For instance, I run
> prairiedog with
> 
>'build_env' => {
>'MAX_CONNECTIONS' => '3',
>   },
> 

Hi,

The difference of behavior might be explained by an upgraded OS and upgraded 
buildfarm script.
I’ll implement the suggested setting as soon as I can powercycle the machine 
which I fail to contact right now.

Regards,

Rémi Zara





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 03:53:55PM -0400, Alvaro Herrera wrote:
> On 2019-Jul-10, Bruce Momjian wrote:
> 
> > Good, so I think we all now agree we have to put the nonce
> > (pg_class.oid, LSN, page-number) though the cipher using the secret.
> 
> Actually, why do you need the page number in the nonce?  The LSN already
> distinguishes pages -- you can't have two pages with the same LSN, can
> you?  (I do think you can have multiple writes of the same page with
> different LSNs, if you change hint bits and don't write WAL about it,
> but maybe we should force CRC enabled in encrypted tables, which I think
> closes this hole?)

Uh, what if a transaction modifies page 0 and page 1 of the same table
--- don't those pages have the same LSN.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Alvaro Herrera
On 2019-Jul-10, Bruce Momjian wrote:

> Uh, what if a transaction modifies page 0 and page 1 of the same table
> --- don't those pages have the same LSN.

No, because WAL being a physical change log, each page gets its own
WAL record with its own LSN.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-07-10 Thread Robert Haas
On Wed, Jul 10, 2019 at 9:59 AM Dave Cramer  wrote:
> I'm still a bit conflicted about what to do with search_path as I do believe 
> this is potentially a security issue.
> It may be that we always want to report that and possibly back patch it.

I don't see that as a feasible option unless we make the logic that
does the reporting smarter.  If it changes transiently inside of a
security-definer function, and then changes back, my recollection is
that right now we would report both changes.  I think that could cause
a serious efficiency problem if you are calling such a function in a
loop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-07-10 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jul 10, 2019 at 9:59 AM Dave Cramer  wrote:
>> I'm still a bit conflicted about what to do with search_path as I do believe 
>> this is potentially a security issue.
>> It may be that we always want to report that and possibly back patch it.

> I don't see that as a feasible option unless we make the logic that
> does the reporting smarter.  If it changes transiently inside of a
> security-definer function, and then changes back, my recollection is
> that right now we would report both changes.  I think that could cause
> a serious efficiency problem if you are calling such a function in a
> loop.

And, even more to the point, what's the client side going to do with
the information?  If there was a security problem inside the
security-definer function, it's too late.  And the client can't do
much about it anyway.

If we have a configurable GUC_REPORT list, and somebody thinks it's useful
to them to report search_path, I don't wish to stand in their way.
But the argument that this is useful is so tissue-thin that we have no
business imposing the overhead on everybody, much less back-patching it.

regards, tom lane




[sqlsmith] Crash in mcv_get_match_bitmap

2019-07-10 Thread Andreas Seltenreich
Hi,

running sqlsmith on the regression database of REL_12_STABLE at
ff597b656f yielded a crash in mcv_get_match_bitmap.  I can reproduce it
with the following query on the regression database:

select filler1 from mcv_lists where a is not null and (select 42) <= c;

Backtrace below.

regards,
Andreas

Program received signal SIGSEGV, Segmentation fault.
pg_detoast_datum (datum=0x0) at fmgr.c:1741
(gdb) bt
#0  pg_detoast_datum (datum=0x0) at fmgr.c:1741
#1  0x55b2bbeb2656 in numeric_le (fcinfo=0x7ffceeb2cb90) at numeric.c:2139
#2  0x55b2bbf3cdca in FunctionCall2Coll 
(flinfo=flinfo@entry=0x7ffceeb2cc30, collation=collation@entry=100,
arg1=, arg2=) at fmgr.c:1162
#3  0x55b2bbdd7aec in mcv_get_match_bitmap (root=0x55b2bd2acff0, 
clauses=, keys=0x55b2bd2c4e38,
mcvlist=0x55b2bd2c44e0, is_or=false) at mcv.c:1638
#4  0x55b2bbdda581 in mcv_clauselist_selectivity 
(root=root@entry=0x55b2bd2acff0, stat=stat@entry=0x55b2bd2c4e00,
clauses=clauses@entry=0x55b2bd2c5298, varRelid=varRelid@entry=0, 
jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x0,
rel=0x55b2bd2c4158, basesel=0x7ffceeb2cd70, totalsel=0x7ffceeb2cd78) at 
mcv.c:1876
#5  0x55b2bbdd6064 in statext_mcv_clauselist_selectivity 
(estimatedclauses=0x7ffceeb2cde8, rel=0x55b2bd2c4158,
sjinfo=, jointype=, varRelid=, 
clauses=0x55b2bd2c4e00, root=)
at extended_stats.c:1146
#6  statext_clauselist_selectivity (root=root@entry=0x55b2bd2acff0, 
clauses=clauses@entry=0x55b2bd2c5010,
varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, 
sjinfo=sjinfo@entry=0x0, rel=0x55b2bd2c4158,
estimatedclauses=0x7ffceeb2cde8) at extended_stats.c:1177
#7  0x55b2bbd27372 in clauselist_selectivity 
(root=root@entry=0x55b2bd2acff0, clauses=0x55b2bd2c5010,
varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, 
sjinfo=sjinfo@entry=0x0) at clausesel.c:94
#8  0x55b2bbd2d788 in set_baserel_size_estimates 
(root=root@entry=0x55b2bd2acff0, rel=rel@entry=0x55b2bd2c4158)
at costsize.c:4411
#9  0x55b2bbd24658 in set_plain_rel_size (rte=0x55b2bd20cf00, 
rel=0x55b2bd2c4158, root=0x55b2bd2acff0) at allpaths.c:583
#10 set_rel_size (root=root@entry=0x55b2bd2acff0, rel=rel@entry=0x55b2bd2c4158, 
rti=rti@entry=1, rte=rte@entry=0x55b2bd20cf00)
at allpaths.c:412
#11 0x55b2bbd264a0 in set_base_rel_sizes (root=) at 
allpaths.c:323
#12 make_one_rel (root=root@entry=0x55b2bd2acff0, 
joinlist=joinlist@entry=0x55b2bd2c49c0) at allpaths.c:185
#13 0x55b2bbd482f8 in query_planner (root=root@entry=0x55b2bd2acff0,
qp_callback=qp_callback@entry=0x55b2bbd48ed0 , 
qp_extra=qp_extra@entry=0x7ffceeb2d070) at planmain.c:271
#14 0x55b2bbd4cb32 in grouping_planner (root=, 
inheritance_update=false, tuple_fraction=)
at planner.c:2048
#15 0x55b2bbd4f900 in subquery_planner (glob=glob@entry=0x55b2bd2b1c88, 
parse=parse@entry=0x55b2bd20cd88,
parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=false, 
tuple_fraction=tuple_fraction@entry=0)
at planner.c:1012
#16 0x55b2bbd509c6 in standard_planner (parse=0x55b2bd20cd88, 
cursorOptions=256, boundParams=) at planner.c:406
#17 0x55b2bbe13b89 in pg_plan_query 
(querytree=querytree@entry=0x55b2bd20cd88, 
cursorOptions=cursorOptions@entry=256,
boundParams=boundParams@entry=0x0) at postgres.c:878
[...]




Re: buildfarm's typedefs list has gone completely nutso

2019-07-10 Thread Andrew Dunstan



On 7/10/19 1:34 PM, Andres Freund wrote:
>
> Hm, it has gotten gcc-9 installed recently, but calliphoridae isn't
> using that. So it's probably not the compiler side. But I also see a
> binutils upgrade:
>
> 2019-07-08 06:22:48 upgrade binutils-multiarch:amd64 2.31.1-16 
> 2.32.51.20190707-1
>
> and corresponding upgrades forall the arch specific packages. I suspect
> it might be that.
>
> I can't immediately reproduce that locally though, using the same
> version of binutils. It's somewhat annoying that the buildfarm uses a
> different form of computing the typedefs than src/tools/find_typedef ...
>


That script is notably non-portable, and hasn't seen any significant
update in a decade.


If you want to run something like the buildfarm code, see

for some clues


I ran the client on a new Fedora 30 and it didn't produce the error.


cheers


andrew







Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-10 Thread Tom Lane
Andreas Seltenreich  writes:
> running sqlsmith on the regression database of REL_12_STABLE at
> ff597b656f yielded a crash in mcv_get_match_bitmap.  I can reproduce it
> with the following query on the regression database:
> select filler1 from mcv_lists where a is not null and (select 42) <= c;

Seems to be the same problem I just complained of in the other
thread: mcv_get_match_bitmap has an untenable assumption that
"is a pseudoconstant" means "is a Const".

I notice it's also assuming that the Const must be non-null.
It's not really easy to crash it that way, because if you just
write "null <= c" that'll get reduced to constant-NULL earlier.
But I bet there's a way.

regards, tom lane




Refactoring syslogger piping to simplify adding new log destinations

2019-07-10 Thread Sehrope Sarkuni
Hi,

While working on adding a new log_destination I noticed that the
syslogger piping would need to be updated. At the moment both ends
only handle stderr/csvlog as the pipe message header has a char
"is_last" that is either t/f (stderr last, stderr partial) or T/F
(csvlog last, csvlog partial). Couple approaches came to mind:

1. Use additional pairs of chars for each additional destination (e.g.
x/X, y/Y, ...) and mimic the logic of csvlog.
2. Repurpose the char "is_last" as a bitmap of the log destination
with the highest order bit indicating whether it's the last chunk.
3. Add a separate field "dest" for the log destination and leave
"is_last" as a t/f indicating whether it's the last chunk.

Attached are patches for each approach (fun exercise!). Also attached
is a basic TAP test to invoke the csvlog destination. It's a clone of
pg_ctl log rotation test that looks for .csv logs. If there's interest
in the test I was thinking of expanding it a bit to include "big"
output that would span multiple messages to test the partial/combining
path. My thoughts on the approaches:

#1 doesn't change the header types or size but seems ugly as it leads
to new pairs of constants and logic in multiple places. In particular,
both send and receive ends have to encode and decode the destination.
#2 is cleaner as there's a logical separation of the dest fields and
no need for new constant pairs when adding new destinations. Would
also need to ensure new LOG_DESTINATION_xyz constants do not use that
last bit (there's already four now so room for three more).
#3 leads to the cleanest code though you lose 4-bytes of max data size
per chunk.

Which would be preferable? I'd like to validate the approach as the
new log destination would be built atop it. I leaning toward #3 though
if the 4-byte loss is a deal breaker then #2.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From 9ec38a587b0c2645bc9fd73398c1debdf9fa962b Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Wed, 10 Jul 2019 09:27:44 -0400
Subject: [PATCH 1/2] Add basic test for csvlog

---
 src/bin/pg_ctl/t/005_csvlog.pl | 94 ++
 1 file changed, 94 insertions(+)
 create mode 100644 src/bin/pg_ctl/t/005_csvlog.pl

diff --git a/src/bin/pg_ctl/t/005_csvlog.pl b/src/bin/pg_ctl/t/005_csvlog.pl
new file mode 100644
index 00..b74c373bb6
--- /dev/null
+++ b/src/bin/pg_ctl/t/005_csvlog.pl
@@ -0,0 +1,94 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 4;
+use Time::HiRes qw(usleep);
+
+# Set up node with logging collector
+my $node = get_new_node('primary');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq(
+logging_collector = on
+lc_messages = 'C'
+log_destination = 'csvlog'
+));
+
+$node->start();
+
+# Verify that log output gets to the file
+
+$node->psql('postgres', 'SELECT 1/0');
+
+my $current_logfiles = slurp_file($node->data_dir . '/current_logfiles');
+
+note "current_logfiles = $current_logfiles";
+
+like(
+	$current_logfiles,
+	qr|^csvlog log/postgresql-.*csv$|,
+	'current_logfiles is sane');
+
+my $lfname = $current_logfiles;
+$lfname =~ s/^csvlog //;
+chomp $lfname;
+
+# might need to retry if logging collector process is slow...
+my $max_attempts = 180 * 10;
+
+my $first_logfile;
+for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
+{
+	$first_logfile = slurp_file($node->data_dir . '/' . $lfname);
+	last if $first_logfile =~ m/division by zero/;
+	usleep(100_000);
+}
+
+like($first_logfile, qr/division by zero/, 'found expected log file content');
+
+# Sleep 2 seconds and ask for log rotation; this should result in
+# output into a different log file name.
+sleep(2);
+$node->logrotate();
+
+# pg_ctl logrotate doesn't wait for rotation request to be completed.
+# Allow a bit of time for it to happen.
+my $new_current_logfiles;
+for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
+{
+	$new_current_logfiles = slurp_file($node->data_dir . '/current_logfiles');
+	last if $new_current_logfiles ne $current_logfiles;
+	usleep(100_000);
+}
+
+note "now current_logfiles = $new_current_logfiles";
+
+like(
+	$new_current_logfiles,
+	qr|^csvlog log/postgresql-.*csv$|,
+	'new current_logfiles is sane');
+
+$lfname = $new_current_logfiles;
+$lfname =~ s/^csvlog //;
+chomp $lfname;
+
+# Verify that log output gets to this file, too
+
+$node->psql('postgres', 'fee fi fo fum');
+
+my $second_logfile;
+for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
+{
+	$second_logfile = slurp_file($node->data_dir . '/' . $lfname);
+	last if $second_logfile =~ m/syntax error/;
+	usleep(100_000);
+}
+
+like(
+	$second_logfile,
+	qr/syntax error/,
+	'found expected log file content in new log file');
+
+$node->stop();
-- 
2.17.1

From 7b9d827e5059445945d214388371138f0676b306 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Wed, 10 Jul 2019 07:17:58 -0400
Subject: [PATCH 2/2] Refactor PipeProtoHeader.is_last constants

Adds c

Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-10 Thread Tomas Vondra

On Wed, Jul 10, 2019 at 04:57:54PM -0400, Tom Lane wrote:

Andreas Seltenreich  writes:

running sqlsmith on the regression database of REL_12_STABLE at
ff597b656f yielded a crash in mcv_get_match_bitmap.  I can reproduce it
with the following query on the regression database:
select filler1 from mcv_lists where a is not null and (select 42) <= c;


Seems to be the same problem I just complained of in the other
thread: mcv_get_match_bitmap has an untenable assumption that
"is a pseudoconstant" means "is a Const".



Yeah, that's a bug. Will fix (not sure how yet).

BTW which other thread? I don't see any other threads mentioning this
function.


I notice it's also assuming that the Const must be non-null.
It's not really easy to crash it that way, because if you just
write "null <= c" that'll get reduced to constant-NULL earlier.
But I bet there's a way.



Hmmm, I did not think of that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-10 Thread Tom Lane
Tomas Vondra  writes:
> BTW which other thread? I don't see any other threads mentioning this
> function.

https://www.postgresql.org/message-id/flat/CA%2Bu7OA65%2BjEFb_TyV5g%2BKq%2BonyJ2skMOPzgTgFH%2BqgLwszRqvw%40mail.gmail.com

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Joe Conway
On 7/10/19 3:53 PM, Alvaro Herrera wrote:
> On 2019-Jul-10, Bruce Momjian wrote:
> 
>> Good, so I think we all now agree we have to put the nonce
>> (pg_class.oid, LSN, page-number) though the cipher using the secret.

(been traveling -- just trying to get caught up on this thread)

> Actually, why do you need the page number in the nonce?  The LSN already
> distinguishes pages -- you can't have two pages with the same LSN, can
> you?  (I do think you can have multiple writes of the same page with
> different LSNs, if you change hint bits and don't write WAL about it,

Do you mean "multiple writes of the same page without..."?

> but maybe we should force CRC enabled in encrypted tables, which I think
> closes this hole?)

If we can use the LSN (perhaps with CRC) without the page number that
would seem to be a good idea.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-10 Thread Tom Lane
Tomas Vondra  writes:
> Yeah, that's a bug. Will fix (not sure how yet).

You could do worse than replace this:

ok = (NumRelids(clause) == 1) &&
(is_pseudo_constant_clause(lsecond(expr->args)) ||
 (varonleft = false,
  is_pseudo_constant_clause(linitial(expr->args;

with something like

if (IsA(linitial(expr->args), Var) &&
IsA(lsecond(expr->args), Const))
   ok = true, varonleft = true;
else if (IsA(linitial(expr->args), Const) &&
 IsA(lsecond(expr->args), Var))
   ok = true, varonleft = false;

Or possibly get rid of varonleft as such, and merge extraction of the
"var" and "cst" variables into this test.

BTW, I bet passing a unary-argument OpExpr also makes this code
unhappy.

regards, tom lane




Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-10 Thread Tomas Vondra

On Wed, Jul 10, 2019 at 05:45:24PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Yeah, that's a bug. Will fix (not sure how yet).


You could do worse than replace this:

   ok = (NumRelids(clause) == 1) &&
   (is_pseudo_constant_clause(lsecond(expr->args)) ||
(varonleft = false,
 is_pseudo_constant_clause(linitial(expr->args;

with something like

if (IsA(linitial(expr->args), Var) &&
IsA(lsecond(expr->args), Const))
   ok = true, varonleft = true;
else if (IsA(linitial(expr->args), Const) &&
 IsA(lsecond(expr->args), Var))
   ok = true, varonleft = false;

Or possibly get rid of varonleft as such, and merge extraction of the
"var" and "cst" variables into this test.



OK, thanks for the suggestion.

I probably also need to look at the "is compatible" test in
extended_stats.c which also looks at the clauses. It may not crash as it
does not attempt to extract the const values etc. but it likely needs to
be in sync with this part.


BTW, I bet passing a unary-argument OpExpr also makes this code
unhappy.


Whooops :-(


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Tomas Vondra

On Wed, Jul 10, 2019 at 04:11:21PM -0400, Alvaro Herrera wrote:

On 2019-Jul-10, Bruce Momjian wrote:


Uh, what if a transaction modifies page 0 and page 1 of the same table
--- don't those pages have the same LSN.


No, because WAL being a physical change log, each page gets its own
WAL record with its own LSN.



What if you have wal_log_hints=off? AFAIK that won't change the page LSN.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Wed, Jul 10, 2019 at 04:11:21PM -0400, Alvaro Herrera wrote:
> >On 2019-Jul-10, Bruce Momjian wrote:
> >
> >>Uh, what if a transaction modifies page 0 and page 1 of the same table
> >>--- don't those pages have the same LSN.
> >
> >No, because WAL being a physical change log, each page gets its own
> >WAL record with its own LSN.
> >
> 
> What if you have wal_log_hints=off? AFAIK that won't change the page LSN.

Alvaro suggested elsewhere that we require checksums for these, which
would also force wal_log_hints to be on, and therefore the LSN would
change.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-10 Thread Tomas Vondra

On Wed, Jul 10, 2019 at 06:04:30PM -0400, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On Wed, Jul 10, 2019 at 04:11:21PM -0400, Alvaro Herrera wrote:
>On 2019-Jul-10, Bruce Momjian wrote:
>
>>Uh, what if a transaction modifies page 0 and page 1 of the same table
>>--- don't those pages have the same LSN.
>
>No, because WAL being a physical change log, each page gets its own
>WAL record with its own LSN.
>

What if you have wal_log_hints=off? AFAIK that won't change the page LSN.


Alvaro suggested elsewhere that we require checksums for these, which
would also force wal_log_hints to be on, and therefore the LSN would
change.



Oh, I see - yes, that would solve the hint bits issue. Not sure we want
to combine the features like this, though, as it increases the costs of
TDE. But maybe it's the best solution.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





  1   2   >