from:"Nico Williams"

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams

On Fri, Oct 06, 2023 at 02:37:06PM -0400, Robert Haas wrote:
> > Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
> > it's whatever the database's encoding is.  Collation can and should be a
> > porperty of a column, since for Unicode it wouldn't be reasonable to
> > make that part of the type.  But codeset+encoding should really be a
> > property of the type if PG were to support more than one.  IMO.
> 
> No, what I mean is, you can't just be like "oh, the varlena will be
> different in memory than on disk" as if that were no big deal.

It would have to be the same in memory as on disk, indeed, but you might
need new types in C as well for that.

> I agree that, as an alternative to encoding being a column property,
> it could instead be completely a type property, meaning that if you
> want to store, say, LATIN1 text in your UTF-8 database, you first
> create a latint1text data type and then use it, rather than, as in the
> model I proposed, creating a text column and then applying a setting
> like ENCODING latin1 to it. I think that there might be some problems

Yes, that was the idea.

> with that model, but it could also have some benefits. [...]

Mainly, I think, whether you want PG to do automatic codeset conversions
(ugly and problematic) or not, like for when using text functions.

Automatic codeset conversions are problematic because a) it can be lossy
(so what to do when it is?) and b) automatic type conversions can be
surprising.

Ultimately the client would have to do its own codeset conversions, if
it wants them, or treat text in codesets other than its local one as
blobs and leave it for a higher app layer to deal with.

I wouldn't want to propose automatic codeset conversions.  If you'd want
that then you might as well declare it has to all be UTF-8 and say no to
any other codesets.

> But, even if we were all convinced that this kind of feature was good
> to add, I think it would almost certainly be wrong to invent new
> varlena features along the way.

Yes.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams

On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote:
> There's a very popular commercial database where, or so I have been
> led to believe, any byte sequence at all is accepted when you try to
> put values into the database. [...]

In other circles we call this "just-use-8".

ZFS, for example, has an option to require that filenames be valid
UTF-8 or not, and if not it will accept any garbage (other than ASCII
NUL and /, for obvious reasons).

For filesystems the situation is a bit dire because:

 - strings at the system call boundary have never been tagged with a
   codeset (in the beginning there was only ASCII)
 - there has never been a standard codeset to use at the system call
   boundary, 
 - there have been multiple codesets in use for decades

so filesystems have to be prepared to be tolerant of garbage, at least
until only Unicode is left (UTF-16 on Windows filesystems, UTF-8 for
most others).

This is another reason that ZFS has form-insensitive/form-preserving
behavior: if you want to use non-UTF-8 filenames then names or
substrings thereof that look like valid UTF-8 won't accidentally be
broken by normalization.

If PG never tagged strings with codesets on the wire then PG has the
same problem, especially since there's multiple implementations of the
PG wire protocol.

So I can see why a "popular database" might want to take this approach.

For the longer run though, either move to supporting only UTF-8, or
allow multiple text types each with a codeset specified in its type.

> At any rate, if we were to go in the direction of rejecting code
> points that aren't yet assigned, or aren't yet known to the collation
> library, that's another way for data loading to fail. Which feels like
> very defensible behavior, but not what everyone wants, or is used to.

Yes.  See points about ZFS.  I do think ZFS struck a good balance.

PG could take the ZFS approach and add functions for use in CHECK
constraints that enforce valid UTF-8, valid Unicode (no use of
unassigned codepoints, no use of private use codepoints not configured
into the database), etc.

Coming back to the "just-use-8" thing, a database could have a text type
where the codeset is not specified, one or more text types where the
codeset is specified, manual or automatic codeset conversions, and
whatever enforcement functions make sense.  Provided that the type
information is not lost at the edges.

> > Whether we ever get to a core data type -- and more importantly,
> > whether anyone uses it -- I'm not sure.
> 
> Same here.

A TEXTutf8 type (whatever name you want to give it) could be useful as a
way to a) opt into heavier enforcement w/o having to write CHECK
constraints, b) documentation of intent, all provided that the type is
not lost on the wire nor in memory.

Support for other codesets is less important.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams

On Tue, Oct 17, 2023 at 05:07:40PM +0200, Daniel Verite wrote:
> > * Add a per-database option to enforce only storing assigned unicode
> > code points.
> 
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
> 
> If we had this option in Postgres 11 released in 2018 it would use
> Unicode 11, and in 2023 this feature would reject thousands of code
> points that have been assigned since then.

Yes, and that's desirable if PG were to normalize text as Jeff proposes,
since then PG wouldn't know how to normalize text containing codepoints
assigned after that.  At that point to use those codepoints you'd have
to upgrade PG -- not too unreasonable.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams

On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote:
> > The fact that there are multiple types of normalization and multiple
> > notions of equality doesn't make this easier.

And then there's text that isn't normalized to any of them.

> NFC is really the only one that makes sense.

Yes.

Most input modes produce NFC, though there may be scripts (like Hangul)
where input modes might produce NFD, so I wouldn't say NFC is universal.

Unfortunately HFS+ uses NFD so NFD can leak into places naturally enough
through OS X.

> I believe that having a kind of text data type where it's stored in NFC
> and compared with memcmp() would be a good place for many users to be -
> - probably most users. It's got all the performance and stability
> benefits of memcmp(), with slightly richer semantics. It's less likely
> that someone malicious can confuse the database by using different
> representations of the same character.
> 
> The problem is that it's not universally better for everyone: there are
> certainly users who would prefer that the codepoints they send to the
> database are preserved exactly, and also users who would like to be
> able to use unassigned code points.

The alternative is forminsensitivity, where you compare strings as
equal even if they aren't memcmp() eq as long as they are equal when
normalized.  This can be made fast, though not as fast as memcmp().

The problem with form insensitivity is that you might have to implement
it in numerous places.  In ZFS there's only a few, but in a database
every index type, for example, will need to hook in form insensitivity.
If so then that complexity would be a good argument to just normalize.

Nico
--

Re: WIP Incremental JSON Parser

2024-01-03 Thread Nico Williams

On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> It seems like a pretty significant savings no matter what. Suppose the
> backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> create an 1MB buffer and feed the data to the parser in 1MB chunks.
> Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> you raise here in some way, we can potentially save even more memory,
> which is great, but even if we don't, we still saved a bunch of memory
> that could not have been saved in any other way.

You could also build a streaming incremental parser.  That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings).  Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.

Nico
--

Re: SET ROLE x NO RESET

2024-01-03 Thread Nico Williams

On Sat, Dec 30, 2023 at 10:16:59AM -0600, Eric Hanson wrote:
> What do you think of adding a NO RESET option to the SET ROLE command?

I've wanted this forever.  Consider using this to implement user
authentication mechanisms in user-defined SQL functions that use `SET
ROLE` with `NO RESET` to "login" the user.  One could implement JWT (or
whatever bearer token schemes) on the server side in PlPgSQL w/ pgcrypto
this way, with zero changes to PG itself, no protocol changes, etc.

For bearer token schemes one could acquire the token externally to the
client and then just `SELECT login(?)`, bind the token, and execute to
login.

Nico
--

Re: SET ROLE x NO RESET

2024-01-03 Thread Nico Williams

On Tue, Jan 02, 2024 at 12:36:38PM -0500, Robert Haas wrote:
> IMHO, the best solution here would be a protocol message to change the
> session user. The pooler could use that repeatedly on the same
> session, but refuse to propagate such messages from client
> connections.

But this requires upgrading clients too.

IMO `SET ROLE .. NO RESET` would be terribly useful.  One could build:

 - login systems (e.g., bearer tokens, passwords) in SQL / PlPgSQL / etc

 - sudo-like things

Though maybe `NO RESET` isn't really needed to build these, since after
all one could use an unprivileged role and a SECURITY DEFINER function
that does the `SET ROLE` following some user-defined authentication
method, and so what if the client can RESET the role, since that brings
it back to the otherwise unprivileged role.

Who needs to RESET roles anyways?  Answer: connection pools, but not
every connection is used via a pool.  This brings up something: attempts
to reset a NO RESET session need to fail in such a way that a connection
pool can detect this and disconnect, or else it needs to fail by
terminating the connection altogether.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-02 Thread Nico Williams

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> One of the frustrations with using the "C" locale (or any deterministic
> locale) is that the following returns false:
> 
>   SELECT 'á' = 'á'; -- false
> 
> because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
> respectively, so memcmp() returns non-zero. But it's really the same
> character with just a different representation, and if you normalize
> them they are equal:
> 
>   SELECT normalize('á') = normalize('á'); -- true

I think you misunderstand Unicode normalization and equivalence.  There
is no standard Unicode `normalize()` that would cause the above equality
predicate to be true.  If you normalize to NFD (normal form decomposed)
then a _prefix_ of those two strings will be equal, but that's clearly
not what you're looking for.

PostgreSQL already has Unicode normalization support, though it would be
nice to also have form-insensitive indexing and equality predicates.

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this specific
case), and it would be nice to be able to preserve input form when
storing strings but then still be able to index and match them
form-insensitively (in the case of 'á' both equivalent representations
should be considered equal, and for UNIQUE indexes they should be
considered the same).

You could also have functions that perform lossy normalization in the
sort of way that soundex does, such as first normalizing to NFD then
dropping all combining codepoints which then could allow 'á' to be eq to
'a'.  But this would not be a Unicode normalization function.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Nico Williams

On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote:
> On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> > I think you misunderstand Unicode normalization and equivalence. 
> > There is no standard Unicode `normalize()` that would cause the
> > above equality predicate to be true.  If you normalize to NFD
> > (normal form decomposed) then a _prefix_ of those two strings will
> > be equal, but that's clearly not what you're looking for.

Ugh, My client is not displying 'a' correctly, thus I misunderstood your
post.

> From [1]:

Here's what you wrote in your post:

| [...] But it's really the same
| character with just a different representation, and if you normalize
| them they are equal:
|
|  SELECT normalize('á') = normalize('á'); -- true

but my client is not displying 'a' correctly!  (It displays like 'a' but
it should display like 'á'.)

Bah.  So I'd (mis)interpreted you as saying that normalize('a') should
equal normalize('á').  Please disregard that part of my reply.

> > There are two ways to write 'á' in Unicode: one is pre-composed (one
> > codepoint) and the other is decomposed (two codepoints in this
> > specific case), and it would be nice to be able to preserve input
> > form when storing strings but then still be able to index and match
> > them form-insensitively (in the case of 'á' both equivalent
> > representations should be considered equal, and for UNIQUE indexes
> > they should be considered the same).
> 
> Sometimes preserving input differences is a good thing, other times
> it's not, depending on the context. Almost any data type has some
> aspects of the input that might not be preserved -- leading zeros in a
> number, or whitespace in jsonb, etc.

Almost every Latin input mode out there produces precomposed characters
and so they effectively produce NFC.  I'm not sure if the same is true
for, e.g., Hangul (Korean) and various other scripts.

But there are things out there that produce NFD.  Famously Apple's HFS+
uses NFD (or something very close to NFD).  So if you cut-n-paste things
that got normalized to NFD and paste them into contexts where
normalization isn't done, then you might start wanting to alter those
contexts to either normalize or be form-preserving/form-insensitive.
Sometimes you don't get to normalize, so you have to pick form-
preserving/form-insensitive behavior.

> If text is stored as normalized with NFC, it could be frustrating if
> the retrieved string has a different binary representation than the
> source data. But it could also be frustrating to look at two strings
> made up of ordinary characters that look identical and for the database
> to consider them unequal.

Exactly.  If you have such a case you might like the option to make your
database form-preserving and form-insensitive.  That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

ZFS (filesystems are a bit like databases) does just that!

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Nico Williams

On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote:
> On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> > Ugh, My client is not displying 'a' correctly
> 
> Ugh. Is that an argument in favor of normalization or against?

Heheh, well, it's an argument in favor of more software getting this
right (darn it).

It's also an argument for building a time machine so HFS+ can just
always have used NFC.  But the existence of UTF-16 is proof that time
machines don't exist (or that only bad actors have them).

> I've also noticed that some fonts render the same character a bit
> differently depending on the constituent code points. For instance, if
> the accent is its own code point, it seems to be more prominent than if
> a single code point represents both the base character and the accent.
> That seems to be a violation, but I can understand why that might be
> useful.

Yes, that happens.  Did you know that the ASCII character set was
designed with overstrike in mind for typing of accented Latin
characters?  Unicode combining sequences are kinda like that, but more
complex.

Yes, the idea really was that you could write a' (or 'a) to get á.
That's how people did it with typewriters anyways.

> > Almost every Latin input mode out there produces precomposed
> > characters and so they effectively produce NFC.
> 
> The problem is not the normal case, the problem will be things like
> obscure input methods, some kind of software that's being too clever,
> or some kind of malicious user trying to confuse the database.

_HFS+ enters the chat_

> > That means that indices
> > need to normalize strings, but tables need to store unnormalized
> > strings.
> 
> That's an interesting idea. Would the equality operator normalize
> first, or are you saying that the index would need to recheck the
> results?

You can optimize this to avoid having to normalize first.  Most strings
are not equal, and they tend to differ early.  And most strings will
likely be ASCII-mostly or in the same form anyways.  So you can just
walk a cursor down each string looking at two bytes, and if they are
both ASCII then you move each cursor forward by one byte, and if then
are not both ASCII then you take a slow path where you normalize one
grapheme cluster at each cursor (if necessary) and compare that.  (ZFS
does this.)

You can also assume ASCII-mostly, load as many bits of each string
(padding as needed) as will fit in SIMD registers, compare and check
that they're all ASCII, and if not then jump to the slow path.

You can also normalize one grapheme cluster at a time when hashing
(e.g., for hash indices), thus avoiding a large allocation if the string
is large.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> The idea is to have a new data type, say "UTEXT", that normalizes the
> input so that it can have an improved notion of equality while still
> using memcmp().

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

Maybe UTF8 might be a better name for the new type, since it would
denote the transform (and would allow for UTF16 and UTF32 some day,
though it's doubtful those would ever happen).

But it's one thing to specify Unicode (and transform) in the type and
another to specify an NF to normalize to on insert or on lookup.

How about new column constraint keywords, such as NORMALIZE (meaning
normalize on insert) and NORMALIZED (meaning reject non-canonical form
text), with an optional parenthetical by which to specify a non-default
form?  (These would apply to TEXT as well when the default encoding for
the DB is UTF-8.)

One could then ALTER TABLE to add this to existing tables.

This would also make it easier to add a form-preserving/form-insensitive
mode later if it turns out to be useful or necessary, maybe making it
the default for Unicode text in new tables.

> Questions:
> 
>  * Would this be useful enough to justify a new data type? Would it be
> confusing about when to choose one versus the other?

Yes.  See above.  I think I'd rather have it be called UTF8, and the
normalization properties of it to be specified as column constraints.

>  * Would cross-type comparisons between TEXT and UTEXT become a major
> problem that would reduce the utility?

Maybe when the database's encoding is UTF_8 then UTEXT (or UTF8) can be an alias
of TEXT.

>  * Should "some_utext_value = some_text_value" coerce the LHS to TEXT
> or the RHS to UTEXT?

Ooh, this is nice!  If the TEXT is _not_ UTF-8 then it could be
converted to UTF-8.  So I think which is RHS and which is LHS doesn't
matter -- it's which is UTF-8, and if both are then the only thing left
to do is normalize, and for that I'd take the LHS' form if the LHS is
UTF-8, else the RHS'.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams

On Wed, Oct 04, 2023 at 01:38:15PM -0700, Jeff Davis wrote:
> On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
> > The SQL standard would have me able to:
> > 
> > [...]
> > _UTF8'Hello, world!' and _LATIN1'Hello, world!'
> 
> Is there a use case for that? UTF-8 is able to encode any unicode code
> point, it's relatively compact, and it's backwards-compatible with 7-
> bit ASCII. If you have a variety of text data in your system (and in
> many cases even if not), then UTF-8 seems like the right solution.
> 
> Text data encoded 17 different ways requires a lot of bookkeeping in
> the type system, and it also requires injecting a bunch of fallible
> transcoding operators around just to compare strings.

Better that than TEXT blobs w/ the encoding given by the `CREATE
DATABASE` or `initdb` default!

It'd be a lot _less_ fragile to have all text tagged with an encoding
(indirectly, via its type which then denotes the encoding).

That would be a lot of work, but starting with just a UTF-8 text type
would be an improvement.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams

On Wed, Oct 04, 2023 at 05:32:50PM -0400, Chapman Flack wrote:
> Well, for what reason does anybody run PG now with the encoding set
> to anything besides UTF-8? I don't really have my finger on that pulse.

Because they still have databases that didn't use UTF-8 10 or 20 years
ago that they haven't migrated to UTF-8?

It's harder to think of why one might _want_ to store text in any
encoding other than UTF-8 for _new_ databases.

Though too there's no reason that it should be impossible other than
lack of developer interest: as long as text is tagged with its encoding,
it should be possible to store text in any number of encodings.

> Could it be that it bloats common strings in their local script, and
> with enough of those to store, it could matter to use the local
> encoding that stores them more economically?

UTF-8 bloat is not likely worth the trouble.  UTF-8 is only clearly
bloaty when compared to encodings with 1-byte code units, like
ISO-8859-*.  For CJK UTF-8 is not much more bloaty than native
non-Unicode encodings like SHIFT_JIS.

UTF-8 is not much bloatier than UTF-16 in general either.

Bloat is not really a good reason to avoid Unicode or any specific TF.

> Also, while any Unicode transfer format can encode any Unicode code
> point, I'm unsure whether it's yet the case that {any Unicode code
> point} is a superset of every character repertoire associated with
> every non-Unicode encoding.

It's not always been the case that Unicode is a strict superset of all
currently-in-use human scripts.  Making Unicode a strict superset of all
currently-in-use human scripts seems to be the Unicode Consortium's aim.

I think you're asking why not just use UTF-8 for everything, all the
time.  It's a fair question.  I don't have a reason to answer in the
negative (maybe someone else does).  But that doesn't mean that one
couldn't want to store text in many encodings (e.g., for historical
reasons).

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams

On Wed, Oct 04, 2023 at 04:01:26PM -0700, Jeff Davis wrote:
> On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote:
> > Better that than TEXT blobs w/ the encoding given by the `CREATE
> > DATABASE` or `initdb` default!
> 
> From an engineering perspective, yes, per-column encodings would be
> more flexible. But I still don't understand who exactly would use that,
> and why.

Say you have a bunch of text files in different encodings for reasons
(historical).  And now say you want to store them in a database so you
can index them and search them.  Sure, you could use a filesystem, but
you want an RDBMS.  Well, the answer to this is "convert all those files
to UTF-8".

> It would take an awful lot of effort to implement and make the code
> more complex, so we'd really need to see some serious demand for that.

Yes, it's better to just use UTF-8.

The DB could implement conversions to/from other codesets and encodings
for clients that insist on it.  Why would clients insist anyways?
Better to do the conversions at the clients.

In the middle its best to just have Unicode, and specifically UTF-8,
then push all conversions to the edges of the system.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Nico Williams

On Thu, Oct 05, 2023 at 07:31:54AM -0400, Robert Haas wrote:
> [...] On the other hand, to do that in PostgreSQL, we'd need to
> propagate the character set/encoding information into all of the
> places that currently get the typmod and collation, and that is not a
> small number of places. It's a lot of infrastructure for the project
> to carry around for a feature that's probably only going to continue
> to become less relevant.

Text+encoding can be just like bytea with a one- or two-byte prefix
indicating what codeset+encoding it's in.  That'd be how to encode
such text values on the wire, though on disk the column's type should
indicate the codeset+encoding, so no need to add a prefix to the value.

Complexity would creep in around when and whether to perform automatic
conversions.  The easy answer would be "never, on the server side", but
on the client side it might be useful to convert to/from the locale's
codeset+encoding when displaying to the user or accepting user input.

If there's no automatic server-side codeset/encoding conversions then
the server-side cost of supporting non-UTF-8 text should not be too high
dev-wise -- it's just (famous last words) a generic text type
parameterized by codeset+ encoding type.  There would not even be a hard
need for functions for conversions, though there would be demand for
them.

But I agree that if there's no need, there's no need.  UTF-8 is great,
and if only all PG users would just switch then there's not much more to
do.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Nico Williams

On Thu, Oct 05, 2023 at 03:49:37PM -0400, Tom Lane wrote:
> Nico Williams  writes:
> > Text+encoding can be just like bytea with a one- or two-byte prefix
> > indicating what codeset+encoding it's in.  That'd be how to encode
> > such text values on the wire, though on disk the column's type should
> > indicate the codeset+encoding, so no need to add a prefix to the value.
> 
> The precedent of BOMs (byte order marks) suggests strongly that
> such a solution would be horrible to use.

This is just how you encode the type of the string.  You have any number
of options.  The point is that already PG can encode binary data, so if
how to encode text of disparate encodings on the wire, building on top
of the encoding of bytea is an option.

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Nico Williams

On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> On Thu, Oct 5, 2023 at 3:15 PM Nico Williams  wrote:
> > Text+encoding can be just like bytea with a one- or two-byte prefix
> > indicating what codeset+encoding it's in.  That'd be how to encode
> > such text values on the wire, though on disk the column's type should
> > indicate the codeset+encoding, so no need to add a prefix to the value.
> 
> Well, that would be making the encoding a per-value property, rather
> than a per-column property like collation as I proposed. I can't see

On-disk it would be just a property of the type, not part of the value.

Nico
--

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Nico Williams

On Fri, Oct 06, 2023 at 02:17:32PM -0400, Robert Haas wrote:
> On Fri, Oct 6, 2023 at 1:38 PM Nico Williams  wrote:
> > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams  
> > > wrote:
> > > > Text+encoding can be just like bytea with a one- or two-byte prefix
> > > > indicating what codeset+encoding it's in.  That'd be how to encode
> > > > such text values on the wire, though on disk the column's type should
> > > > indicate the codeset+encoding, so no need to add a prefix to the value.
> > >
> > > Well, that would be making the encoding a per-value property, rather
> > > than a per-column property like collation as I proposed. I can't see
> >
> > On-disk it would be just a property of the type, not part of the value.
> 
> I mean, that's not how it works.

Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
it's whatever the database's encoding is.  Collation can and should be a
porperty of a column, since for Unicode it wouldn't be reasonable to
make that part of the type.  But codeset+encoding should really be a
property of the type if PG were to support more than one.  IMO.

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-02 Thread Nico Williams

On Mon, Jul 02, 2018 at 06:56:34PM +0300, Alvaro Hernandez wrote:
> On 21/06/18 21:43, Nico Williams wrote:
> >Incidentally, PG w/ pgcrypto and FDW does provide everything one needs
> >to be able to implement client-side crypto:
> >
> >  - use PG w/ FDW as a client-side proxy for the real DB
> >  - use pgcrypto in VIEWs with INSTEAD OF triggers in the proxy
> >  - access the DB via the proxy
> 
> Sounds a bit hackish, but it could work. I doubt however the acceptance
> of a solution like this, given the number of "moving parts" and operational
> complexity associated with it.

Well, you could use SQLite3 instead as the client.  Like how GDA does
it.

I do wish there was a libpostgres -- a light-weight postgres for running
an in-process RDBMS in applications without having to have a separate
set of server processes.  That would work really well for a use case
like this one where you're really going to be using FDW to access the
real data store.

If your objection is to an RDBMS in the application accessing real data
via FDW, well, see all the commentary about threat models.  You really
can't protect against DBAs without client-side crypto (and lots of bad
trade-offs).  You can do the crypto in the application, but you start to
lose the power of SQL.  Anyways, I don't think client-side crypto is the
best answer to the DBA threat -- access reduction + auditing is easier
and better.

In any case, spinning up a postgres instance just for this use case is
easy because it wouldn't have any persistent state to keep locally.

> >Any discussion of cryptographic applications should start with a
> >discussion of threat models.  This is not a big hurdle.
> 
> You already mentioned that there are also "business criteria" to
> consider here, and they are important. And there are even more to consider.

The threat model *is* a business criterion.  What are the threats you
want to protect against?  Why aren't you interested in these other
threats?  These are *business* questions.

Of course, one has to understand the issues, including intangibles.  For
example, reputation risk is a huge business concern, but as an
intangible it's easy to miss.  People have a blind spot for intangibles.

> For instance, cases where (1) and even (2) under your proposed threat model
> cannot be fixed by external (device/filesystem) encryption. Consider for
> example the managed database services provided by the cloud vendors. While
> (all?) of them provide transparent disk encryption, are they trust-able? Do

Databases won't be your only cloud security issue.  At some point you're
either using things like SGX or, while you wait for practical, high-
performance homomorphic cryptgraphic computing, you'll settle for using
the power of contracts and contract law -- contract law is a *business*
tool.

At some point there's not much difference between an insider you can
fire and an insider at a vendor who can fire them (and which vendor you
can fire as well), especially when you can use contract law to get the
vendor to do special things for you, like show you how they do things,
reduce admin access, let you audit them, and so on.

> business want to rely on their encryption scheme, key management, and how
> they respond from requests to hand off encryption keys? I believe
> self-contained solutions are very worth, also because of this.

I don't, because client-side crypto to deal with this is just so very
unsatisfactory.  Besides, what happens when first you move the DB into
the cloud and put a lot of effort into client-side crypto, then you move
the clients into the same cloud too?

And anyways, what was proposed by OP is *server-side* crypto, which
clearly does not help at all in this case.

Nico
--

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-02 Thread Nico Williams

On Tue, Jun 26, 2018 at 04:54:13PM -0400, Robbie Harwood wrote:
> Nico Williams  writes:
> 
> > [Re-send; first attempt appears to have hit /dev/null somewhere.  My
> > apologies if you get two copies.]
> >
> > I've finally gotten around to rebasing this patch and making the change
> > that was requested, which was: merge the now-would-be-three deferral-
> > related bool columns in various pg_catalog tables into one char column.
> >
> > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
> > (deferral).
> 
> This design seems correct to me.  I have a couple questions inline, and
> some nits to go with them.

Thanks for the review.  I'm traveling (on vacation).  I'll try to get to
your comments within a week.  Thanks!

Re: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-07-02 Thread Nico Williams

On Mon, Jul 02, 2018 at 06:22:46PM +0900, Masahiko Sawada wrote:
> On Fri, Jun 22, 2018 at 2:31 PM, Tsunakawa, Takayuki
>  wrote:
> > From: Nico Williams [mailto:n...@cryptonector.com]
> >
> >> One shortcoming of relying on OS functionality for protection against
> >> malicious storage is that not all OSes may provide such functionality.
> >> This could be an argument for implementing full, transparent encryption
> >> for an entire DB in the postgres server.  Not a very compelling
> >> argument, but that's just my opinion -- reasonable people could differ
> >> on this.
> >
> > Yes, this is one reason I developed TDE in our product.  And
> > in-database encryption allows optimization by encrypting only user
> > data.

You're likely getting some things terribly wrong.  E.g., integrity
protection.  Most likely you're getting a false sense of security.

> Me too. In-database encryption is helpful in practice. I think 1) and
> 2) seem to cover the thread models which the data encryption in
> database needs to defend.

Yes, but piecemeal encryption seems like a bad idea to me.

Nico
--

Re: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-07-03 Thread Nico Williams

On Tue, Jul 03, 2018 at 07:28:42PM +0900, Masahiko Sawada wrote:
> On Tue, Jul 3, 2018 at 7:16 AM, Nico Williams  wrote:
> > Yes, but piecemeal encryption seems like a bad idea to me.
> 
> What do you mean by "piecemeal encryption"? Is it not-whole database
> encryption such as per-table or per-tablespace? If so could you please
> elaborate on the reason why you think so?

I mean that encrypting some columns only, or some tables only, has
integrity protection issues.  See earlier posts in this thread.

Encrypting the whole DB has no such problems, assuming you're doing the
crypto correctly anyways.  But for full DB encryption it's easier to
leave the crypto to the filesystem or device drivers.  (If the devices
are physically in the host and cannot be removed easily, then FDE at the
device works well too.)

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-09 Thread Nico Williams

On Thu, Jul 05, 2018 at 01:15:15AM +, Tsunakawa, Takayuki wrote:
> From: Craig Ringer [mailto:cr...@2ndquadrant.com]
> > I'm assuming you don't want to offer a grant that lets anyone use them for
> > anything. But if you have a really broad grant to PostgreSQL, all someone
> > would have to do to inherit the grant is re-use some part of PostgreSQL.
> 
> Your assumption is right.  No scope is the same as no patent; it won't help 
> to defend PostgreSQL community against rival companies/communities of other 
> DBMSs.  Or, I think we can set the scope to what OIN states.  Fortunately, 
> anyone can join OIN free of charge.
> 
> 
> > I guess there's a middle ground somewhere that protects substantial
> > derivatives and extracts but stops you using some Pg code snippets as a
> > freebie license.
> 
> Are you assuming that developers want to use PG code snippets for
> non-PostgreSQL or even non-DBMS software?  I believe that accepting
> patented code from companies would be practically more useful for
> PostgreSQL enhancement and growth.  PostgreSQL is now a mature
> software, and it can be more corporate-friendly like other software
> under Apache License.

Suppose I have my own patches, not yet contributed to PG, and that I'm
using them in production.  Can I use my patched version of PG with your
functionality?

Suppose I am developing my own PostgreSQL derivative, with my own secret
sauce perhaps, and perhaps I'm using it to build a proprietary cloud DB
service.  Can I use my patched version of PG with your functionality?

I suspect your answer to the latter would be "absolutely not".

Maybe the first one would be OK if you can somehow distinguish it from
the latter?

Anyways, as a user of PG and occasinal hacker of PG, I would have to
insist on a ./configure way to exclude all functionality not licensed to
me for my use cases, and I would have to insist on all such code being
very well segregated (I don't want to look at your code!), and I would
insist too on any free configuration being highly tested.

If I were a core developer I would have to think long and hard about
whether I could meet user requirements and still have your code in-tree,
and I'm not sure I could do both of those.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-09 Thread Nico Williams

On Sat, Jul 07, 2018 at 10:20:35AM -0700, Andres Freund wrote:
> It's entirely possible to dual license contributions and everything. Why
> are you making such aggressive statements about a, so far, apparently
> good faith engagement?

One problem is that many contributors would not want to be tainted by
knowledge of the patents in question (since that leads to triple
damages).

How would you protect contributors and core developers from tainting?

One possible answer is that you wouldn't.  But that might reduce the
size of the community, or lead to a fork.

Dual licensing does not get you out of these issues.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-09 Thread Nico Williams

On Mon, Jul 09, 2018 at 08:29:08AM +, Tsunakawa, Takayuki wrote:
> > There are arguments made that TPL (and BSD, MIT etc) already includes an
> > implicit patent grant, but while a longstanding theory, it's to my
> > knowledge not legally been tested.
> 
> When we find a reasonable consensus here, I'd like to have our legal
> department write a patent grant statement, and then have it reviewed
> in this ML.  Is the above statement of Red Hat's enough?

You're acting as a go-between between your legal department and
executive directors on the one hand, and the PG community on the other.
That's not going to be an efficient conversation unless your legal
department understands open source communities and is willing to tailor
their patent grants to suit.

Also, these are *business* decisions that should be made by your
directors, not by your legal department.  So really, this should be a
discussion between an authorized director at your company and the
community, with all discussions with your legal department being
internal to your company.

Also, your legal department will almost certainly default to the most
restrictive patent grants.  That will contribute to making this
conversation unproductive.

My advice is that you relay all of this to your legal department and
your executives, and ask them make the most liberal patent grant
possible such that the PG community can live with it.

Also, I would advise you that patents can be the kiss of death for
software technologies.  For example, in the world of cryptography, we
always look for patent-free alternatives and build them from scratch if
need be, leading to the non-use of patented algorithms/protocols in many
cases.  Your best bet is to make a grant so liberal that the only
remaining business use of your patents is defensive against legal
attacks on the holder.

(IMO software patent lifetimes should be commensurate with the effort it
takes to come up with and develop an idea for the field -- five to eight
years max, not seventeen or twenty.)

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Tue, Jul 10, 2018 at 08:20:53AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > On Sat, Jul 07, 2018 at 10:20:35AM -0700, Andres Freund wrote:
> > > It's entirely possible to dual license contributions and everything. Why
> > > are you making such aggressive statements about a, so far, apparently
> > > good faith engagement?
> > 
> > One problem is that many contributors would not want to be tainted by
> > knowledge of the patents in question (since that leads to triple
> > damages).
> > 
> > How would you protect contributors and core developers from tainting?
> 
> IIUC, you are concerned about the possibility that PG developers would
> read the patent document (not the PG source code), and unconsciously
> use the patented algorithm for other software that's not related to
> PostgreSQL.  That would only be helped by not reading the patent
> document...  BTW, are you relieved the current PostgreSQL doesn't
> contain any patented code?  As far as I know, PostgreSQL development
> process doesn't have a step to check patent and copyright
> infringement.

You're proposing to include code that implements patented ideas with a
suitable patent grant.  I would be free to not read the patent, but what
if the code or documents mention the relevant patented algorithms?

If I come across something like this in the PG source code:

/* The following is covered by patents US#... and so on */

now what?

I could choose not to read it.  But what if I have to touch that code in
order to implement an unrelated feature due to some required refactoring
of internal interfaces used by your code?  Now I have to read it, and
now I'm tainted, or I must choose to abandon my project.

That is a heavy burden on the community.  The community may want to
extract a suitable patent grant to make that burden lighter.

> > One possible answer is that you wouldn't.  But that might reduce the
> > size of the community, or lead to a fork.
> 
> Yes, that's one unfortunate future, which I don't want to happen of
> course.  I believe PostgreSQL should accept patent for further
> evolution, because PostgreSQL is now a popular, influential software
> that many organizations want to join.

I don't speak for the PG community, nor the core developers.  Speaking
for myself only, I hope that you can get, and PG accepts only, the
widest possible royalty-free patent grant to the community, allowing
others to not be constrained in making derivatives of PG.

My advice is to write up a patent grant that allows all to use the
relevant patents royalty-free with a no-lawsuit covenant.  I.e., make
only defensive use of your patents.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Tue, Jul 10, 2018 at 09:47:09AM -0400, Tom Lane wrote:
> The core team has considered this matter, and has concluded that it's
> time to establish a firm project policy that we will not accept any code
> that is known to be patent-encumbered.  The long-term legal risks and
> complications involved in doing that seem insurmountable, given the
> community's amorphous legal nature and the existing Postgres license
> wording (neither of which are open for negotiation here).  Furthermore,
> Postgres has always been very friendly to creation of closed-source
> derivatives, but it's hard to see how inclusion of patented code would
> not cause serious problems for those.  The potential benefits of
> accepting patented code just don't seem to justify trying to navigate
> these hazards.

+1.

You should probably consider accepting code that involves patents that
are in the public domain or expired by the time of release, though even
that involves some legal costs you might not want to incur.  E.g., what
if a US patent is in the public domain but a corresponding EU patent is
not?  A search could be done, naturally, but professional patent
searches are not cheap!

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Wed, Jul 11, 2018 at 01:03:44AM +, Tsunakawa, Takayuki wrote:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> > The core team has considered this matter, and has concluded that it's
> > time to establish a firm project policy that we will not accept any code
> > that is known to be patent-encumbered.  The long-term legal risks and
> > complications involved in doing that seem insurmountable, given the
> > community's amorphous legal nature and the existing Postgres license
> > wording (neither of which are open for negotiation here).  Furthermore,
> > Postgres has always been very friendly to creation of closed-source
> > derivatives, but it's hard to see how inclusion of patented code would
> > not cause serious problems for those.  The potential benefits of
> > accepting patented code just don't seem to justify trying to navigate
> > these hazards.
> 
> That decision is very unfortunate as a corporate employee on one hand,
> but the firm cleanness looks a bit bright to me as one developer.
> 
> As a practical matter, when and where are you planning to post the
> project policy?  How would you check and prevent patented code?

PG may need a contributor agreement.  All I can find on that is:

https://www.postgresql.org/message-id/flat/54337476.3040605%402ndquadrant.com#9b1968ddc0fadfe225001adc1a821925

Nico
--

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

On Tue, Jun 26, 2018 at 04:54:13PM -0400, Robbie Harwood wrote:
> Nico Williams  writes:
> 
> > [Re-send; first attempt appears to have hit /dev/null somewhere.  My
> > apologies if you get two copies.]
> >
> > I've finally gotten around to rebasing this patch and making the change
> > that was requested, which was: merge the now-would-be-three deferral-
> > related bool columns in various pg_catalog tables into one char column.
> >
> > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
> > (deferral).
> 
> This design seems correct to me.  I have a couple questions inline, and
> some nits to go with them.

Thanks.  Replies below.

> > All tests (make check) pass.
> 
> Thank you for adding tests!

Well, yeah :)

> > diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> > index 3ed9021..e82e39b 100644
> > --- a/doc/src/sgml/catalogs.sgml
> > +++ b/doc/src/sgml/catalogs.sgml
> > @@ -2239,17 +2239,15 @@ SCRAM-SHA-256$<iteration 
> > count>:&l
> >   
> >  
> >   
> > -  condeferrable
> > -  bool
> > -  
> > -  Is the constraint deferrable?
> > - 
> > -
> > - 
> > -  condeferred
> > -  bool
> > +  condeferral
> > +  char
> >
> > -  Is the constraint deferred by default?
> > +  Constraint deferral option:
> > +a = always deferred,
> > +d = deferrable,
> > +d = deferrable initially deferred,
> 
> From the rest of the code, I think this is supposed to be 'i', not 'd'.

Oh yes, good catch.

> > +n = not deferrable
> > +  
> >   
> >  
> >   
> 
> > diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
> > index 8b276bc..795a7a9 100644
> > --- a/src/backend/catalog/index.c
> > +++ b/src/backend/catalog/index.c
> > @@ -1070,6 +1070,7 @@ index_create(Relation heapRelation,
> >  
> > recordDependencyOn(&myself, &referenced, 
> > deptype);
> > }
> > +   Assert((flags & INDEX_CREATE_ADD_CONSTRAINT) == 0);
> 
> What does this ensure, and why is it in this part of the code?  We're in
> the `else` branch here - isn't this always the case (i.e., Assert can
> never fire), since `flags` isn't manipulated in this block?

Oy, nothing.  I think that's a cut-n-paste error.

I'll remove it.

> > }
> >  
> > /* Store dependency on parent index, if any */
> 
> > diff --git a/src/backend/catalog/information_schema.sql 
> > b/src/backend/catalog/information_schema.sql
> > index f4e69f4..bde6199 100644
> > --- a/src/backend/catalog/information_schema.sql
> > +++ b/src/backend/catalog/information_schema.sql
> > @@ -891,10 +891,14 @@ CREATE VIEW domain_constraints AS
> > CAST(current_database() AS sql_identifier) AS domain_catalog,
> > CAST(n.nspname AS sql_identifier) AS domain_schema,
> > CAST(t.typname AS sql_identifier) AS domain_name,
> > -   CAST(CASE WHEN condeferrable THEN 'YES' ELSE 'NO' END
> > +   CAST(CASE WHEN condeferral = 'n' THEN 'NO' ELSE 'YES' END
> >   AS yes_or_no) AS is_deferrable,
> > -   CAST(CASE WHEN condeferred THEN 'YES' ELSE 'NO' END
> > +   CAST(CASE WHEN condeferral = 'i' OR condeferral = 'a' THEN 
> > 'YES' ELSE 'NO' END
> >   AS yes_or_no) AS initially_deferred
> > +  /*
> > +   * XXX Can we add is_always_deferred here?  Are there
> > +   * standards considerations?
> > +   */
> 
> I'm not familiar enough to speak to this.  Hopefully someone else can.
> Absent other input, I think we should add is_always_deferred.  (And if
> we were building this today, probably just expose a single character
> instead of three booleans.)

I had found the answer ("yes") in the history of this file but forgot to
remove the comment while rebasing.  I'll remove the comment.

> >  FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t
> >  WHERE rs.oid = con.connamespace
> >AND n.oid = t.typnamespace
> 
> > diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
> > index 57519fe..41dc6a4 100644
> > --- a/src/backend/commands/trigger.c
> > +++ b/src/backend/

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

On Wed, Jul 11, 2018 at 03:13:30PM -0400, Alvaro Herrera wrote:
> On 2018-Jun-06, Nico Williams wrote:
> > I've finally gotten around to rebasing this patch and making the change
> > that was requested, which was: merge the now-would-be-three deferral-
> > related bool columns in various pg_catalog tables into one char column.
> > 
> > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
> > (deferral).
> 
> Nice stuff.
> 
> Please add #defines for the chars in pg_constraint.h instead of using
> the values directly in the source.  Also, catalogs.sgml has a typo in
> one of the values.
> 
> What happens if you do SET CONSTRAINTS ALL IMMEDIATE and you have one of
> these constraints?  I expect that it silently does not alter that
> constraint, but you didn't change that manpage.

Correct, that's the idea, that it should be possible to write deferred
constraints/triggers which users cannot make immediate.  For example, an
audit extension that logs changes via FOR EACH ROW ALWAYS DEFERRED, or
my PoC COMMIT TRIGGER implementation (which depends on deferred
triggers).

I missed that there is a page for SET CONSTRAINTS!  I'll update it.

> I suggest not to include psql/tab-complete changes with your main patch.
> If you want to update it, split it out to its own patch.  Committer can
> choose to include it in one commit or not (I'm mildly inclined not to,
> but I'm probably inconsistent about it), but for review IMO it's better
> not to mix things -- It's way too easy to get entangled in silly details
> while editing that code, and it's not critical anyway.

OK, sure, though, of course, the committer could always leave that out
themselves, no?

To me though it seems that the change should be complete.

> > Incidentally, I had to do commit-by-commit rebasing to make the rebase
> > easier.  I have a shell function I use for this, if anyone wants a copy
> > of it -- sometimes it's much easier to do this than to do one huge jump.
> 
> I've done this manually a few times.  Please share, I'm curious.

OK, attached is my latest version of that script, though this one is a
bit changed from the one I used.  This version tries to be faster / more
efficient by first doing 1 commit, then 2, then 3, and so on, and on
conflict aborts and halves N to try again.  The idea is to only have to
merge conflicts at each commit where conflicts arise, never resolving
conflicts across more than one commit -- this makes is much easier to
reason about conflicts!

Note that the script is actually a shell function, and that it keeps
state in shel variables.  A better implementation would do the sort of
thing that git(1) itself does to keep rebase state.

Nico
-- 
# Based on a shell function by Viktor Dukhovni
#
# slowrebase BRANCH_TO_REBASE ONTO
function slowrebase {
typeset b N

if (($# > 0)) && [[ $1 = -h || $1 = --help ]]; then
printf 'Usage: slowrebase BRANCH_TO_REBASE ONTO_HEAD\n'
printf '   slowrebase # to continue after resolving conflicts\n'
printf '\n\tslowrebase is a shell function that uses the following\n'
printf '\tglobal variables to keep state: $S $T $B ${C[@]}\n'
printf '\t$slowrebase_window_sz\n'
printf '\tDO NOT CHANGE THOSE VARIABLES.\n'
return 0
elif (($# > 0 && $# != 2)); then
printf 'Usage: slowrebase BRANCH_TO_REBASE ONTO_HEAD\n' 1>&2
printf '   slowrebase # to continue after resolving conflicts\n'
printf '\n\tslowrebase is a shell function that uses the following\n' 
1>&2
printf '\tglobal variables to keep state: $S $T $B ${C[@]}\n' 1>&2
printf '\t$slowrebase_window_sz\n' 1>&2
printf '\tDO NOT CHANGE THOSE VARIABLES.\n' 1>&2
return 1
fi

if (($# == 2)); then
slowrebase_window_sz=1
S=$1
T=$2
B=$(git merge-base "$S" "$T")
C=( $(git log --oneline "$B".."$2" | awk '{print $1}') )
set --

# Prep
git log -n1 "$S" > /dev/null || return 1
if [[ $(git log --oneline -n1 HEAD) != $(git log --oneline -n1 "$S") 
]]; then
if (($(git status -sb | wc -l) != 1)); then
printf 'Error: please clean your workspace\n'
return 1
fi
git checkout "$S"
elif (($(git status -sbuno | wc -l) != 1)); then
printf 'Error: please clean your workspace\n'
return 1
fi

# Fall through to get started
e

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > My advice is to write up a patent grant that allows all to use the
> > relevant patents royalty-free with a no-lawsuit covenant.  I.e., make
> > only defensive use of your patents.
> 
> How can one make defensive use of his patent if he allows everyone to
> use it royalty-free?  Can he use his patent for cross-licensing
> negotiation if some commercial database vendor accuses his company
> that PostgreSQL unintentionally infringes that vendor's patent?

https://en.wikipedia.org/wiki/Defensive_termination

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

On Wed, Jul 11, 2018 at 01:41:12PM -0500, Nico Williams wrote:
> > > @@ -5538,17 +5568,24 @@ ConstraintAttributeSpec:
> > >   int newspec = $1 | $2;
> > >  
> > >   /* special message for this case */
> > > - if ((newspec & (CAS_NOT_DEFERRABLE | 
> > > CAS_INITIALLY_DEFERRED)) == (CAS_NOT_DEFERRABLE | CAS_INITIALLY_DEFERRED))
> > > + if ((newspec & CAS_NOT_DEFERRABLE) &&
> > > + (newspec & 
> > > (CAS_INITIALLY_DEFERRED | CAS_ALWAYS_DEFERRED)))
> > >   ereport(ERROR,
> > >   
> > > (errcode(ERRCODE_SYNTAX_ERROR),
> > >
> > > errmsg("constraint declared INITIALLY DEFERRED must be DEFERRABLE"),
> > >
> > > parser_errposition(@2)));
> > >   /* generic message for other conflicts 
> > > */
> > > + if ((newspec & CAS_ALWAYS_DEFERRED) &&
> > > + (newspec & 
> > > (CAS_INITIALLY_IMMEDIATE)))
> > > + ereport(ERROR,
> > > + 
> > > (errcode(ERRCODE_SYNTAX_ERROR),
> > > +  
> > > errmsg("conflicting constraint properties 1"),
> > > +  
> > > parser_errposition(@2)));
> > >   if ((newspec & (CAS_NOT_DEFERRABLE | 
> > > CAS_DEFERRABLE)) == (CAS_NOT_DEFERRABLE | CAS_DEFERRABLE) ||
> > >   (newspec & 
> > > (CAS_INITIALLY_IMMEDIATE | CAS_INITIALLY_DEFERRED)) == 
> > > (CAS_INITIALLY_IMMEDIATE | CAS_INITIALLY_DEFERRED))
> > >   ereport(ERROR,
> > >   
> > > (errcode(ERRCODE_SYNTAX_ERROR),
> > > -  
> > > errmsg("conflicting constraint properties"),
> > > +  
> > > errmsg("conflicting constraint properties 2"),
> > 
> > I'd prefer you just repeat the message (or make them more situationally
> > descriptive), rather than appending a number.  (Repeating error messages
> > is in keeping with the style here.)
> 
> Oy, I forgot about these.  The number was a bread crumb I forgot to
> cleanup :(  So sorry about that.

So, I'm tempted not to add new messages that will require translation,
leaving these as "conflicting constraint properties".

But I'm perfectly happy to make these more informative too if that's
desired.  I just don't know what the expectations are regarding message
catalog translation.  Any advice?

Nico
--

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

Attached is an additional patch, as well as a new, rebased patch.

This includes changes responsive to Álvaro Herrera's commentary about
the SET CONSTRAINTS manual page.

Nico
-- 
>From e7838b60dbf0a8cd7f35591db2f9aab78d8903cb Mon Sep 17 00:00:00 2001
From: Nicolas Williams 
Date: Wed, 11 Jul 2018 19:53:01 -0500
Subject: [PATCH] Add ALWAYS DEFERRED option for CONSTRAINTs (CR)

---
 doc/src/sgml/catalogs.sgml |  2 +-
 doc/src/sgml/ref/set_constraints.sgml  | 10 ++
 src/backend/catalog/index.c|  1 -
 src/backend/catalog/information_schema.sql |  8 +++-
 src/backend/commands/trigger.c |  1 -
 src/backend/parser/gram.y  | 18 ++
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 291e6a9..4d42594 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2245,7 +2245,7 @@ SCRAM-SHA-256$:&l
   Constraint deferral option:
 a = always deferred,
 d = deferrable,
-d = deferrable initially deferred,
+i = deferrable initially deferred,
 n = not deferrable
   
  
diff --git a/doc/src/sgml/ref/set_constraints.sgml b/doc/src/sgml/ref/set_constraints.sgml
index 671332a..390015e 100644
--- a/doc/src/sgml/ref/set_constraints.sgml
+++ b/doc/src/sgml/ref/set_constraints.sgml
@@ -34,11 +34,13 @@ SET CONSTRAINTS { ALL | name [, ...
   
 
   
-   Upon creation, a constraint is given one of three
+   Upon creation, a constraint is given one of four
characteristics: DEFERRABLE INITIALLY DEFERRED,
-   DEFERRABLE INITIALLY IMMEDIATE, or
-   NOT DEFERRABLE. The third
-   class is always IMMEDIATE and is not affected by the
+   DEFERRABLE INITIALLY IMMEDIATE,
+   NOT DEFERRABLE, or ALWAYS DEFERRED.
+   The third
+   class is always IMMEDIATE, while the fourth class is
+   always DEFERRED, and neither affected by the
SET CONSTRAINTS command.  The first two classes start
every transaction in the indicated mode, but their behavior can be changed
within a transaction by SET CONSTRAINTS.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 795a7a9..45b52b4 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1070,7 +1070,6 @@ index_create(Relation heapRelation,
 
 recordDependencyOn(&myself, &referenced, deptype);
 			}
-			Assert((flags & INDEX_CREATE_ADD_CONSTRAINT) == 0);
 		}
 
 		/* Store dependency on parent index, if any */
diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index bde6199..dd4792a 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -894,11 +894,9 @@ CREATE VIEW domain_constraints AS
CAST(CASE WHEN condeferral = 'n' THEN 'NO' ELSE 'YES' END
  AS yes_or_no) AS is_deferrable,
CAST(CASE WHEN condeferral = 'i' OR condeferral = 'a' THEN 'YES' ELSE 'NO' END
- AS yes_or_no) AS initially_deferred
-	   /*
-	* XXX Can we add is_always_deferred here?  Are there
-	* standards considerations?
-	*/
+ AS yes_or_no) AS initially_deferred,
+   CAST(CASE WHEN condeferral = 'a' THEN 'YES' ELSE 'NO' END
+ AS yes_or_no) AS always_deferred
 FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t
 WHERE rs.oid = con.connamespace
   AND n.oid = t.typnamespace
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 41dc6a4..33b1095 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3627,7 +3627,6 @@ typedef struct AfterTriggerSharedData
 	TriggerEvent ats_event;		/* event type indicator, see trigger.h */
 	Oid			ats_tgoid;		/* the trigger's ID */
 	Oid			ats_relid;		/* the relation it's on */
-	bool			ats_alwaysdeferred;	/* whether this can be deferred */
 	CommandId	ats_firing_id;	/* ID for firing cycle */
 	struct AfterTriggersTableData *ats_table;	/* transition table access */
 } AfterTriggerSharedData;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dab721a..9aaa2af 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -185,8 +185,8 @@ static void SplitColQualList(List *qualList,
 			 List **constraintList, CollateClause **collClause,
 			 core_yyscan_t yyscanner);
 static void processCASbits(int cas_bits, int location, const char *constrType,
-			   char *deferral,
-			   bool *not_valid, bool *no_inherit, core_yyscan_t yyscanner);
+			   char *deferral, bool *not_valid, bool *no_inherit,
+			   core_yyscan_t yyscanner);
 static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %}
@@ -5579,13 +5579,13 @@ ConstraintAttributeSpec:
 		(newspec & (CAS_INITIALLY_IMMEDIATE)))
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Thu, Jul 12, 2018 at 01:10:33AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> > > How can one make defensive use of his patent if he allows everyone to
> > > use it royalty-free?  Can he use his patent for cross-licensing
> > > negotiation if some commercial database vendor accuses his company
> > > that PostgreSQL unintentionally infringes that vendor's patent?
> > 
> > https://en.wikipedia.org/wiki/Defensive_termination
> 
> Thank you, his looks reasonable to give everyone the grant.  Then, I
> wonder if the community can accept patented code if the patent holder
> grants this kind of license.

I'm not a core member, but personally I'd be inclined to accept a
royalty-free, non-exclusive, non-expiring, transferrable, ..., grant
where the only condition is to not sue the patent holder.

I don't object to patents in general, but I think patent lifetimes are
way too long for software (because they are not commensurate with the
costs of developing software), and I understand that often one must
obtain patents for *defensive* purposes because there are sharks out
there.  So I'm inclined to accept patents with defensive but otherwise
very wide grants.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Thu, Jul 12, 2018 at 09:33:21AM +0800, Craig Ringer wrote:
> On 12 July 2018 at 09:10, Tsunakawa, Takayuki <
> tsunakawa.ta...@jp.fujitsu.com> wrote:
> > From: Nico Williams [mailto:n...@cryptonector.com]
> > > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> > > > How can one make defensive use of his patent if he allows everyone to
> > > > use it royalty-free?  Can he use his patent for cross-licensing
> > > > negotiation if some commercial database vendor accuses his company
> > > > that PostgreSQL unintentionally infringes that vendor's patent?
> > >
> > > https://en.wikipedia.org/wiki/Defensive_termination
> >
> > Thank you, his looks reasonable to give everyone the grant.  Then, I
> > wonder if the community can accept patented code if the patent holder
> > grants this kind of license.
> 
> Personally, I'd be in favour, but the core team has spoken here. I'm not
> privy to the discussion but the outcome seems firm.

Has the issue been considered in enough detail?

A thorough treatment of the issue would probably mean that PG should
have a contributor agreement, or at the very least the license that PG
uses should include generous royalty-free patent grants so as to force
contributors to make them.  Of course, any time you add contributor
agreements you add friction to the contribution process, and at least
one-time legal costs for the core.

Also, what about patents that are placed in the public domain?  Surely
code implementing those would not be rejected for involving patents...

I'm not sure that the widest grant with no-lawsuit defensive clauses
should be accepted, but it's not that far from the public domain case.
Even here there's legal costs: at least a one-time cost of hiring an IP
lawyer to write up patent grant language that the PG community would
insist on.  It could be well worth it.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 12:20:53PM -0700, Andres Freund wrote:
> On 2018-07-19 11:57:25 +0300, Heikki Linnakangas wrote:
> > Ugh. Yeah, in wal_quickdie, and other aux process *_quickdie() handlers, I
> > agree we should just _exit(2). All we want to do is to exit the process
> > immediately.
> 
> Indeed.
> 
> > bgworker_quickdie() makes some effort to block SIGQUIT during the exit()
> > processing, but that doesn't solve the whole problem. The process could've
> > been in the middle of a malloc/free when the signal arrived, for example.
> > exit() is simply not safe to call from a signal handler.
> 
> Yea. I really don't understand why we have't been able to agree on this
> for *years* now.

I mean, only calling async-signal-safe functions from signal handlers is
a critical POSIX requirement, and exit(3) is NOT async-signal-safe.

> > The regular backend's quickdie() function is more tricky. It should also
> > call _exit(2) rather than exit(2). But it also tries to ereport a WARNING,
> > and that is quite useful.
> 
> Is that actually true? Clients like libpq create the same error message
> (which has its own issues, because it'll sometimes mis-interpret
> things). The message doesn't actually have useful content, no?

Is ereport() async-signal-safe?  No.  It could be, but it uses stdio to
write to stderr/console, and stdio is not async-signal-safe.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 03:49:35PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-07-19 11:57:25 +0300, Heikki Linnakangas wrote:
> >> The regular backend's quickdie() function is more tricky. It should also
> >> call _exit(2) rather than exit(2). But it also tries to ereport a WARNING,
> >> and that is quite useful.
> 
> There's already an on_exit_reset in there; why do we need more than that?

You're not allowed to call exit(3) in signal handlers.  exit(3) is not
async-signal-safe (for example, it calls atexit handlers, flushes stdio,
and who knows what else).

> > Is that actually true? Clients like libpq create the same error message
> > (which has its own issues, because it'll sometimes mis-interpret
> > things). The message doesn't actually have useful content, no?
> 
> Yes, it does: it lets users tell the difference between exit due to a
> SIGQUIT and a crash of their own backend.

ereport() calls (via errstart() and write_stderr()) vfprintf() and
fflush(stderr).  That's absolutely not async-signal-safe.  The WIN32
code in write_stderr() does use vsnprintf() instead, and that could get
written to stderr with write(2), which is async-signal-safe.

> Admittedly, if we crash trying to send the message, then we're not
> better off.  But since that happens only very rarely, I do not think
> it's a reasonable tradeoff to never send it at all.

If sending it means using OpenSSL or what have you, that's probably even
more async-signal-safe code.

Now, ereport() could all be made safe enough for use in signal handlers,
but it isn't at this time.

What I'd do is have a volatile sig_atomic_t in_signal_handler_context
variable to indicate that we're dying, and then when that is non-zero,
ereport() and friends could use all-async-signal-safe codepaths.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jun 22, 2017 at 03:10:31PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Or, probably more robust: Simply _exit(2) without further ado, and rely
> > on postmaster to output an appropriate error message. Arguably it's not
> > actually useful to see hundreds of "WARNING: terminating connection because 
> > of
> > crash of another server process" messages in the log anyway.
> 
> At that point you might as well skip the entire mechanism and go straight
> to SIGKILL.  IMO the only reason quickdie() exists at all is to try to
> send a helpful message to the client.  And it is helpful --- your attempt
> to claim that it isn't sounds very much like wishful thinking.

I dunno if it is or isn't helpful.  But I do know that this must be done
in an async-signal-safe way.

Besides making ereport() async-signal-safe, which is tricky, you could
write(2) the arguments to a pipe that another thread in the same process
is reading from and which will then call ereport() and exit(3).  This
would be less work if you're willing to use a thread for that (the
thread would only block in read(2) on that pipe, and would only provide
this one service).

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 04:04:01PM -0400, Tom Lane wrote:
> Nico Williams  writes:
> > What I'd do is have a volatile sig_atomic_t in_signal_handler_context
> > variable to indicate that we're dying, and then when that is non-zero,
> > ereport() and friends could use all-async-signal-safe codepaths.
> 
> I eagerly await your patch with an async-safe implementation of ereport...

Once my ALWAYS DEFERRED patch is done, sure :)

Also, see my other post just now where I propose a thread-based
mechanism for making quickdie() async-signal-safe.  Basically, there
would be a pipe and a thread that only blocks in read(2) on that pipe,
and quickdie() would write to the pipe the relevant details, then that
thread would call ereport() and then exit(3).  This would be much much
simpler to implement than making ereport() async-signal-safe.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 01:10:14PM -0700, Andres Freund wrote:
> On 2018-07-19 15:04:15 -0500, Nico Williams wrote:
> > Besides making ereport() async-signal-safe, which is tricky, you could
> > write(2) the arguments to a pipe that another thread in the same process
> > is reading from and which will then call ereport() and exit(3).  This
> > would be less work if you're willing to use a thread for that (the
> > thread would only block in read(2) on that pipe, and would only provide
> > this one service).
> 
> It'd also increase memory usage noticably (we'd have twice the process
> count in the kernel, would have a lot of additional stacks etc), would
> tie us to supporting threading in the backend, ...  This is a DOA
> approach imo.

You can create that thread with a really small stack given that its only
purpose is to do this error reporting and exit.

Running a thread that does only this does not impact the rest of the
code in the backend at all -- it's not "threading" the backend.  When it
gets invoked, the caller would be blocking / sleeping, waiting for the
coming exit, while this helper thread would block until invoked.  It's
really not a big deal.

I use this technique in some of my programs (unfortunately none in my
github repos).  Usually I use it for detection of parent process death
(so that if the parent dies silently, the children die too).  In that
case the child-side of fork() closes the write end of a pipe and starts
a thread that blocks in read(2) on the read end of the pipe, and exit()s
when the read returns anything other than EINTR.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 04:16:31PM -0400, Tom Lane wrote:
> Nico Williams  writes:
> > I dunno if it is or isn't helpful.  But I do know that this must be done
> > in an async-signal-safe way.
> 
> I haven't actually heard a convincing reason why that's true.  As per

It's undefined behavior.  The system is free to do all sorts of terrible
things.  In practice deadlock or crash are the things you'll see, as you
say.

> the previous discussion, if we happen to service the SIGQUIT at an
> unfortunate moment, we might get a deadlock or crash in the backend
> process, and thereby fail to send the message.  But we're no worse
> off in such cases than if we'd not tried to send it at all.  The only
> likely penalty is that, in the deadlock case, a few seconds will elapse
> before the postmaster runs out of patience and sends SIGKILL.
> 
> Yeah, it'd be nicer if we weren't taking such risks, but the amount
> of effort required to get there seems very far out of proportion to
> the benefit.
> 
> > Besides making ereport() async-signal-safe, which is tricky, you could
> > write(2) the arguments to a pipe that another thread in the same process
> > is reading from and which will then call ereport() and exit(3).
> 
> We have not yet accepted any patch that introduces threading into the
> core backend, and this seems like a particularly unlikely place to
> start.  Color me dubious, as well, that thread 2 calling exit() (never
> mind ereport()) while the main thread continues to do $whatever offers
> any actual gain in safety.

No, the other thread does NOT continue to do whatever -- it
blocks/sleeps forever waiting for the coming exit(3).

I.e., quickdie() would look like this:

/* Marshall info in to an automatic */
struct quickdie_info info, *infop;

info.this = that;
...
infop = &info;

/* Tell the death thread to report and exit */
/* XXX actually something like net_write(), to handle EINTR */
write(quickdie_pipe[1], infop, sizeof(infop));

/* Wait forever */
for (;;) usleep(FOREVER);

and the thread would basically do:

struct quickdie_info *info;

/* Wait forever for a quickdie() to happen on the main thread */
/* XXX Actually something like net_read(), to handle EINTR */
read(quickdie_pipe[0], &infop, sizeof(infop));

ereport(infop->...);

exit(1);

This use of threads does not require any changes to the rest of the
codebase.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 01:35:02PM -0700, Andres Freund wrote:
> On 2018-07-19 15:17:26 -0500, Nico Williams wrote:
> > You can create that thread with a really small stack given that its only
> > purpose is to do this error reporting and exit.
> 
> You still have a full kernel process backing it, which is *FAR* from
> free. [...]

It's not that big.  Its stack is small.
> [...] And we'd enough infrastructure to setup threads with small stacks
> on a number of platforms.

Portability is actually the problem here, yes.  But you could enable the
async-signal-safe thread path in some platforms and not others, and
you'd still be improving things.

> > Running a thread that does only this does not impact the rest of the
> > code in the backend at all -- it's not "threading" the backend.
> 
> It actually does. Without passing thread related flags to the compiler,
> which you need to do for correctness, the compiler and libraries are
> free to use faster non-threadsafe implementations.

That's build goop, but there's no need to add mutexes or whatever
elsewhere.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 01:38:52PM -0700, Andres Freund wrote:
> On 2018-07-19 15:27:06 -0500, Nico Williams wrote:
> > No, the other thread does NOT continue to do whatever -- it
> > blocks/sleeps forever waiting for the coming exit(3).
> > 
> > I.e., quickdie() would look like this:
> > 
> > [...]
> > 
> > and the thread would basically do:
> > 
> > [...]
> > 
> > This use of threads does not require any changes to the rest of the
> > codebase.
> 
> Uhm, this'd already require a fair bit of threadsafety. Like at least
> all of the memory allocator / context code.  Nor is having threads
> around unproblematic for subprocesses that are forked off.  Nor does
> this account for the portability work.

Yes, but that's in libc.  None of that is in the PG code itself.

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 03:42:46PM -0500, Nico Williams wrote:
> On Thu, Jul 19, 2018 at 01:38:52PM -0700, Andres Freund wrote:
> > Uhm, this'd already require a fair bit of threadsafety. Like at least
> > all of the memory allocator / context code.  Nor is having threads
> > around unproblematic for subprocesses that are forked off.  Nor does
> > this account for the portability work.
> 
> Yes, but that's in libc.  None of that is in the PG code itself.

Hmm, it would have perf impact, yes.  Could the postmaster keep a pipe
to all the backend processes and do reporting for them?

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-19 Thread Nico Williams

On Thu, Jul 19, 2018 at 01:46:12PM -0700, Andres Freund wrote:
> On 2018-07-19 15:42:46 -0500, Nico Williams wrote:
> > Yes, but that's in libc.  None of that is in the PG code itself.
> 
> That's simply entirely completely wrong. PG has a good chunk of memory
> management layers (facilitating memory contexts) ontop of malloc. And
> that's stateful.

Ah, Ok, I see.  Fine.

Re: Add constraint in a Materialized View

2018-07-20 Thread Nico Williams

On Wed, Jul 18, 2018 at 09:28:19AM +0200, Kaye Ann Ignacio wrote:
> I'm trying to add a foreign constraint in my local table to reference a
> column in a materialized view, is it possible to alter this materialized
> view by adding a primary key constraint?

It's not, but I'm working on a patch for that and much more.  Basically,
I want to integrate something like
https://github.com/twosigma/postgresql-contrib/blob/master/mat_views.sql

I currently use that code instead of PG native materialized views.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 09:56:47AM -0400, Bruce Momjian wrote:
> On Mon, Jul 23, 2018 at 06:31:14AM -0700, Andres Freund wrote:
> > It explicitly says irrevocable and successors. Why seems squarely
> > aimed at your concern. Bankruptcy wouldn't just invalidate that.
> 
> They can say whatever they want, but if they are bankrupt, what they say
> doesn't matter much.  My guess is that they would have to give their
> patents to some legal entity that owns them so it is shielded from
> bankrupcy.

Can you explain how a new owner could invalidate/revoke previous
irrevocable grants?

That's not rhetorical.  I want to know if that's possible.

Perhaps patent law [in some countries] requires contracts as opposed to
licenses?

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 10:13:48AM -0400, Chapman Flack wrote:
> On 07/23/2018 10:01 AM, Bruce Momjian wrote:
> 
> > And the larger question is whether a patent free for use by software
> > under any license can be used in a defensive way.  If not, it means we
> > have no way forward here.
> 
> Isn't 'defensive', in patent-speak, used to mean 'establishing prior
> art usable to challenge future patent claims by others on the same
> technique'?

Not prior art.  You don't need patents to establish prior art.  Just
publish your idea and you're done.  (The problem with just publishing an
idea is that someone might have a patent application regarding a similar
idea and might modify their application to cover yours once they see it.
Later, when you go claim that you have prior art, there will be a heavy
burden of presumption on you.  This, of course, encourages you to apply
for a patent...)

Patents let you have counter-suit options should you get sued for patent
infringement -- that's the "defensive" in defensive patents.  You need a
large portfolio of patents, and this doesn't work against patent trolls.

There's also no-lawsuit covenants in grants.  You get to use my patent
if you don't sue me for using yours, and if you sue me then my grants to
you are revoked.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 11:40:41AM -0400, Bruce Momjian wrote:
> On Mon, Jul 23, 2018 at 08:19:35AM -0700, Andres Freund wrote:
> > I'm fairly sure that I'm right. But my point isn't that we should "trust
> > Andres implicitly ™" (although that's obviously not a bad starting point
> > ;)). But rather, given that that is a reasonable assumption that such
> > agreements are legally possible, we can decide whether we want to take
> > advantage of such terms *assuming they are legally sound*. Then, if, and
> > only if, we decide that that's interesting from a policy POV, we can
> > verify those assumptions with lawyers.
> 
> > 
> > Given we're far from the first project dealing with this, and that
> > companies that have shown themselves to be reasonably trustworthy around
> > open source, like Red Hat, assuming that such agreements are sound seems
> > quite reasonable.
> 
> Sun Microsystems seemed reasonably trustworthy too.

Are there patent grants from Sun that Oracle has attempted to renege on?
Are there court cases about that?  Links?

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 11:55:01AM -0400, Bruce Momjian wrote:
> On Mon, Jul 23, 2018 at 11:40:41AM -0400, Bruce Momjian wrote:
> > On Mon, Jul 23, 2018 at 08:19:35AM -0700, Andres Freund wrote:
> > > I'm fairly sure that I'm right. But my point isn't that we should "trust
> > > Andres implicitly ™" (although that's obviously not a bad starting point
> > > ;)). But rather, given that that is a reasonable assumption that such
> > > agreements are legally possible, we can decide whether we want to take
> > > advantage of such terms *assuming they are legally sound*. Then, if, and
> > > only if, we decide that that's interesting from a policy POV, we can
> > > verify those assumptions with lawyers.
> > 
> > > 
> > > Given we're far from the first project dealing with this, and that
> > > companies that have shown themselves to be reasonably trustworthy around
> > > open source, like Red Hat, assuming that such agreements are sound seems
> > > quite reasonable.
> > 
> > Sun Microsystems seemed reasonably trustworthy too.
> 
> I realize what you are saying is that at the time Red Hat wrote that,
> they had good intentions, but they might not be able to control its
> behavior in a bankruptcy, so didn't mention it.  Also, Oracle is suing
> Google over the Java API:
> 
>   https://en.wikipedia.org/wiki/Oracle_America,_Inc._v._Google,_Inc.
> 
> which I can't imagine Sun doing, but legally Oracle can now that they
> own Java via Sun.  Of course, Sun might not have realized the problem,
> and Red Hat might have, but that's also assuming that there aren't other
> problems that Red Hat doesn't know about.

That's not about patents though, is it.

(I do believe that case is highly contrived.  Sun put Java under the
GPL, so presumably Google can fork it under those terms.  I've not
followed that case, so I don't really know what's up with it or why it
wasn't just dismissed with prejudice.)

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 01:12:19PM -0400, Bruce Momjian wrote:
> On Mon, Jul 23, 2018 at 11:27:49AM -0500, Nico Williams wrote:
> > Perhaps patent law [in some countries] requires contracts as opposed to
> > licenses?
> 
> Yes, I really don't know.  I have just seen enough "oh, we didn't think
> of that" to be cautious.

So, is it FUD?  The core needs paid-for legal advice, not speculation.

I'm quite certain that a software license can make a patent grant to the
satisfaction of many open source communities, and almost certainly to
the satisfaction of the PG community.  But it will take an IP lawyer to
review or write such a license.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 01:12:49PM -0400, Bruce Momjian wrote:
> On Mon, Jul 23, 2018 at 11:37:05AM -0500, Nico Williams wrote:
> > On Mon, Jul 23, 2018 at 11:40:41AM -0400, Bruce Momjian wrote:
> > > Sun Microsystems seemed reasonably trustworthy too.
> > 
> > Are there patent grants from Sun that Oracle has attempted to renege on?
> > Are there court cases about that?  Links?
> 
> No, but I bet there are things Oracle is doing that no one at Sun
> expected to be done, and users who relied on Sun didn't expect to be
> done.

There are questions that the PG core needs help with and which IP
lawyers are needed to answer.  There are also business questions,
because sure, even if a patent owner makes an acceptable grant, how fast
and cheaply you could get a lawsuit by them dismissed on the basis of
that grant is a business consideration.

We, the non-lawyer PG community, can give input such as that which I've
contributed:

 - I won't read/modify source code involving patents whose grants are
   not as wide as X

 - the PG core needs advice from IP lawyers

 - patents placed in the public domain surely are safe for PG

 - there must be patent grant language acceptable to PG

Just merely "but they could do something bad!" from us non-lawyers is
not very good advice.  Already PG incurs the risk that its contributors
could act in bad faith.  For example, a contributor's employer might sue
PG under copyright and/or trade secrecy law claiming the contribution
was not authorized (this is why some open source projects require
contributor agreements).

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 03:06:13PM -0400, Bruce Momjian wrote:
> On Mon, Jul 23, 2018 at 02:02:40PM -0500, Nico Williams wrote:
> > On Mon, Jul 23, 2018 at 01:12:19PM -0400, Bruce Momjian wrote:
> > > On Mon, Jul 23, 2018 at 11:27:49AM -0500, Nico Williams wrote:
> > > > Perhaps patent law [in some countries] requires contracts as opposed to
> > > > licenses?
> > > 
> > > Yes, I really don't know.  I have just seen enough "oh, we didn't think
> > > of that" to be cautious.
> > 
> > So, is it FUD?  The core needs paid-for legal advice, not speculation.
> > 
> > I'm quite certain that a software license can make a patent grant to the
> > satisfaction of many open source communities, and almost certainly to
> > the satisfaction of the PG community.  But it will take an IP lawyer to
> > review or write such a license.
> 
> And is the payback worth it?  Many don't think so.

Wouldn't that be something to decide on a case-by-case basis?

Recall, a contributor might have acquired a patent only for the purpose
of defensive use, and the patent might involve useful-to-PG ideas.

Moreover, if PG re-invents those ideas later it then risks exposure to
lawsuit anyways.  Arguably, accepting suitably-wide patent grants serves
to protect PG from this risk!  Is that FUD from me, or good advice?
You'll need a lawyer to decide.

The important thing is the legal language of the grant.  This is why PG
needs legal advice.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-23 Thread Nico Williams

On Mon, Jul 23, 2018 at 12:12:03PM -0700, Joshua D. Drake wrote:
> On 07/23/2018 12:06 PM, Bruce Momjian wrote:
> >>So, is it FUD?  The core needs paid-for legal advice, not speculation.
> >>
> >>I'm quite certain that a software license can make a patent grant to the
> >>satisfaction of many open source communities, and almost certainly to
> >>the satisfaction of the PG community.  But it will take an IP lawyer to
> >>review or write such a license.
> >And is the payback worth it?  Many don't think so.
> 
> Although Nico is correct, I also think we need to consider what the
> community wants here. Historically, we have always explicitly avoided
> anything to do with patents to the point where some hackers won't even read
> white papers on patented methods. I do think there is a definite
> technological advantage for PostgreSQL if there was a license that core
> could accept that was patent friendly but frankly, I don't think that core
> or the community has the desire to work through the cost of doing so.

Absolutely.

I myself don't want to be tainted when reading PG source code (or even
docs).

I believe PG needs to demand at least royalty-free, world-wide,
non-exclusive, transferrable license, either irrevocable (i.e., without
further conditions) or with a no-lawsuit covenant (i.e., revocable when
the patent owner gets sued by the patent user).  Such terms would be
acceptable to me as a third-party contributor.  Other contributors might
have a different take.  An IP lawyer could tighten that up and turn it
into legalese.

There might exist a license with suitable patent grant clauses that PG
could adopt for new contributions.  (Not GPLv3, of course, since that
one is viral, but still, it has patent grant language that could be
adapted for PG.)

I myself would not welcome patent grants that apply only to PG itself
and not to forks of it.  I would also not welcome grants that apply only
to PG and forks of it, but not new non-PG implementations of
the same patent (whether open source or not) -- i.e., don't taint me.

I suspect most other contributors too would not want to be tainted, so I
think that's probably the most critical requirement, and that does drive
towards the minimum grant breadth described above.

Some contributors may also oppose no-lawsuit covenants, but I believe
PG will benefit much more from allowing them than from disallowing them.

At any rate, PG absolutely should accept code involving patents that are
placed in the public domain provided that the owner places all relevant
patents (in the U.S., in the EU, and so on -- wherever they have them)
in the public domain.  Yes, I'm aware that Germany does not recognize
the idea of "public domain".  This too requires legal advice, which my
advice isn't.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-24 Thread Nico Williams

On Tue, Jul 24, 2018 at 04:28:51PM +0100, Dave Page wrote:
> On Tue, Jul 24, 2018 at 4:26 PM, Tomas Vondra 
> wrote:
> > Clean room design addresses copyright-related issues, not patents.
> 
> Correct. It's important folks realise that!

Indeed.

It's also important to know, when reading PG source code or docs, that
there are no descriptions of patented algorithms nor references to
unexpired patents in PG code / docs, unless those have the widest
possible grant (as described before and below), are in the public
domain, or are expired.  Updating the PG license to make such patent
grants (thus committing contributors to making them too) would be the
best and simplest way to gain the reader's confidence that PG is not
tainting them.

Note that it's OK to *accidentally* implement patented algorithms as
long as the author of the contribution didn't know about.  There's no
trebble damages in that case, and no tainting of others, plus,
contributors and code reviewers/committers can't be expected to do
patent searches for each contribution.

IMO PG should have:

 - a license that grants patent rights as described before

   (royalty-free, non-exclusive, transferrable, irrevocable or revocable
   only where a relevant patent holder is sued, and then revocable only
   for the user doing the suing)

 - an employer contributor agreement whereby an individual contributor's
   employer's director agrees to allow PG-relevent contributions (by the
   contributor) to PG under the PG license (including patent grant
   text), and committing the contributor to disclose all non-expired
   patents embodied in the contribution

 - an individual contributor agreement whereby an individual contributor
   commits to following the PG code of conduct (which should include
   text about patents and copyrights), to make only PG-relevant
   contributions that the individual's employer (if they have one)
   permits, and to place the contributions under the PG license
   (including patent grant text), and so on

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-24 Thread Nico Williams

On Tue, Jul 24, 2018 at 06:29:37PM -0400, Isaac Morland wrote:
> On 24 July 2018 at 18:17, Nico Williams  wrote:
> > Note that it's OK to *accidentally* implement patented algorithms as
> > long as the author of the contribution didn't know about.  There's no
> > trebble damages in that case, and no tainting of others, plus,
> > contributors and code reviewers/committers can't be expected to do
> > patent searches for each contribution.
> 
> Non-lawyer here, but "OK" is not a description I would apply to
> implementing a patented algorithm. You might be thinking of copyright. Of
> course it is true that people can't reasonably be expected to do patent
> searches, as you describe, but the patent system as applied to software is
> not well-known among knowledgeable people for being reasonable.

Wrong.  With patents the important thing is not to know about them when
you implement -- if you come up with the same idea by accident (which,
of course, is obviously entirely possible) then you are not subject to
trebble damages.  But if you knowingly copy the invention without a
license then you are subject to trebble damages.

A lot of patented ideas are fairly obvious.  That always seems true
after the fact, naturally, but many are fairly obvious even before
knowing about them.  It's clearly possible that you'll infringe by
accident -- that's OK by comparison to infringing on purpose.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-25 Thread Nico Williams

On Wed, Jul 25, 2018 at 03:06:22AM -0400, Chapman Flack wrote:
> On 07/25/18 01:56, Nico Williams wrote:
> 
> > Wrong.  With patents the important thing is not to know about them when
> > you implement -- if you come up with the same idea by accident (which,
> > of course, is obviously entirely possible) then you are not subject to
> > trebble damages.
> 
> Even if the damages are not trebled, can 1✕ the damages be more than you
> would like to shell out? Not to mention the hassle of getting any infringing
> uses to cease?

You get a chance to stop infringing.  Also, how could you avoid
accidentally re-inventing patented algorithms if not by doing a costly
patent search every time you open $EDITOR??

> Also, is this distinction universally applied across jurisdictions?

It doesn't matter.  The point is that no one does a patent search every
time they design an algorithm -- not unless they mean to patent it.  So
you can't avoid accidentally re-inventing patented algorithms.

Re: How can we submit code patches that implement our (pending) patents?

2018-07-25 Thread Nico Williams

On Wed, Jul 25, 2018 at 02:48:01PM +0700, Benjamin Scherrey wrote:
> If you violate a patent, knowingly or otherwise, you are subject to
> penalties (perhaps not treble but still penalties) and will have to remove
> the offending code unless a deal is reached with the patent holder.

Unless you do a patent search every time you design something, you can't
avoid the risk of accidentally infringing.

> It is critical that Postgres require that all contributors do not enforce
> patents against Postgres - full stop. That's the IP agreement that should
> be in place.

My take is that the PG license should make royalty-free, non-exclusive,
transferrable, worldwide patent grants that are either irrevocable or
revocable only againts those who sue the owner.  This together with a
contributor agreement would stop contributions by patent trolls, but it
still wouldn't prevent accidental re-inventions.

I don't understand why it's not obvious that one can unknowingly and
accidentally re-invent someone else's idea.

Re: Early WIP/PoC for inlining CTEs

2018-07-25 Thread Nico Williams

On Wed, Jul 25, 2018 at 07:42:37AM +0200, David Fetter wrote:
> Please find attached a version rebased atop 167075be3ab1547e18 with
> what I believe are appropriate changes to regression test output.  The
> other changes to the regression tests output are somewhat puzzling, as
> they change the actual results of queries.  I've also attached both
> the "leftover" diff and the files to which it should be applied.

I think the SQL programmer needs some control over whether a CTE is:

 - a materialized view -- and therefore a barrier
 - a view (which can then be inlined by the optimizer)

It is possible to add a keyword for this purpose in the WITH syntax:

WITH   VIEW (...) AS a_view
 , MATERIALIZED VIEW (...) AS a_barrier
...;

This would be a lot like creating TEMP views, but without the catalog
overhead.

(I wonder how hard it would be to partiion the OID namespace into
temp/persistent ranges so that temp schema elements need not be written
into the catalog.)

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-25 Thread Nico Williams

On Wed, Jul 25, 2018 at 11:45:58AM -0400, Chapman Flack wrote:
> On 07/25/2018 11:25 AM, Nico Williams wrote:
> 
> > I don't understand why it's not obvious that one can unknowingly and
> > accidentally re-invent someone else's idea.
> 
> It's perfectly obvious. It's the chief reason the whole topic
> of software patents has been deeply controversial for so long.

Thanks.

> You seem to be using it as part of a proof by contradiction:

wat

>   One can unknowingly and accidentally reinvent
>   a patented idea.
> 
>   If that were not tidily excused in practice,
>   software patents would be deeply problematic.
>   --
> 
>   Therefore, it must be the case that unknowing and
>   accidental reinvention is tidily excused in practice.
> 
> I don't think it works.

I didn't say it's excused.  The damages that can be awarded are just
three times less.  In practice it is common to settle for no longer
infringing.

What are you proposing anyways?  That every commit come with a patent
search?

Nico
--

Re: Early WIP/PoC for inlining CTEs

2018-07-25 Thread Nico Williams

On Wed, Jul 25, 2018 at 05:08:43PM +0100, Andrew Gierth wrote:
>  Nico> It is possible to add a keyword for this purpose in the WITH syntax:
> 
>  Nico> WITH   VIEW (...) AS a_view
> 
> The existing (and standard) syntax is WITH ctename AS (query).

Oy, I flubbed that up.

> Syntaxes that have been suggested for explicitly controlling the
> materialization are along the lines of:
> 
> WITH ctename AS [[NOT] MATERIALIZED] (query)
> 
> (MATERIALIZED is already an un-reserved keyword)

Works for me.

Re: Early WIP/PoC for inlining CTEs

2018-07-25 Thread Nico Williams

On Tue, Jul 24, 2018 at 07:57:49PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-07-24 19:49:19 -0400, Tom Lane wrote:
> >> However, a singly-referenced SELECT CTE could reasonably be treated as
> >> equivalent to a sub-select-in-FROM, and then you would have the same
> >> mechanisms for preventing inlining as you do for those cases,
> >> e.g. OFFSET 0.  And sticking in OFFSET 0 would be backwards-compatible
> >> too: your code would still work the same in older releases, unlike if
> >> we invent new syntax for this.
> 
> > I still think this is just doubling down on prior mistakes.
> 
> Not following what you think a better alternative is?  I'd be the
> first to agree that OFFSET 0 is a hack, but people are used to it.
> 
> Assuming that we go for inline-by-default for at least some cases,
> there's a separate discussion to be had about whether it's worth
> making a planner-control GUC to force the old behavior.  I'm not
> very excited about that, but I bet some people will be.

It is widely known that CTEs in PG are optimizer barriers.

That actually is useful, and I do make use of that fact (though I'm not
proud of it).

My proposal is that PG add an extension for specifying that a CTE is to
be materialized (barrier) or not (then inlined).

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-26 Thread Nico Williams

On Thu, Jul 26, 2018 at 04:42:24PM -0400, Tom Lane wrote:
> It's barely possible that we could get current and new contributors to
> sign some kind of CLA containing anti-patent terms, but I don't think
> there's any hope of amending the distribution license.

Leaving aside whether you could change the license, or apply a new
license only to new contributions (that might be hard unless you're
making the history part of the product), ...

...if a CLA is too difficult too, then perhaps demand disclosure by
authors of any knowledge they might have of patents that their
contribution implements.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-27 Thread Nico Williams

On Fri, Jul 27, 2018 at 09:30:45AM -0400, Robert Haas wrote:
> If you think that the lack of a CLA and a patent grant never causes
> extensive conversations with legal, I am quite certain that you are
> incorrect.  I know of multiple instances where this has been a
> concern.
> 
> Other open source projects even more prominent and successful than
> PostgreSQL have done these things.  [...]

FWIW, some have done it poorly.  I had trouble getting an employer to
sign the Oracle contributor agreement, for example, because the OCA did
not allow the employer to specify who could make the contribution (only
who the director is who signs the OCA) and also did not allow the
employer to narrow the scope of contributions to those relevant to the
OpenJDK.

I also recommend adding to the PG code of conduct text requiring
disclosure of relevant patent rights that the contributor knows about.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-27 Thread Nico Williams

Even assuming you can't change the PG license, you could still:

 - require disclosure in contributions
 - require a wide grant in contributions
 - document all such grants separately from the copyright license

Putting the grants in the license is convenient, but it's not required
to include patent language in order to get the desired effect.  The
license is just a copyright license -- it could stay just that.

You could also require a CLA in the case where disclosure is made, or
even in all cases.  But a CLA for all cases would be a bit too
burdensome.

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-27 Thread Nico Williams

On Fri, Jul 27, 2018 at 10:01:40AM -0700, Andres Freund wrote:
> On 2018-07-27 11:15:00 -0500, Nico Williams wrote:
> > Even assuming you can't change the PG license, you could still:
> > 
> >  - require disclosure in contributions
> 
> That really has no upsides, except poison the area.  [...]

Sure it does: a) it allows PG to reject such contributions unless they
come with suitable grants, and b) it gives PG users a legal defense
should they ever be sued for infringement on a patent that was not
disclosed to PG at the time of the contribution.  (b) is a big deal.

Nico
--

Re: [PATCH v18] GSSAPI encryption support

2018-08-06 Thread Nico Williams

On Mon, Aug 06, 2018 at 10:36:34AM -0400, Stephen Frost wrote:
> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> > Sorry if this sounds facetious, but:
> > 
> > What is the point of this patch? What's the advantage of GSSAPI encryption
> > over SSL? I was hoping to find the answer by reading the documentation
> > changes, but all I can see is "how" to set it up, and nothing about "why".
> 
> If you've already got an existing Kerberos environment, then it's a lot
> nicer to leverage that rather than having to also implement a full PKI
> to support and use SSL-based encryption.
> 
> There's also something to be said for having alternatives to OpenSSL.

Those two reasons would be my motivation if I were implementing this,
and they are some of the reasons I did a code review.

Nico
--

Re: Have an encrypted pgpass file

2018-08-08 Thread Nico Williams

On Tue, Jul 24, 2018 at 12:25:31PM +0200, Marco van Eck wrote:
> Indeed having unencrypted password lying (.pgpass or PGPASSWORD or -W)
> around is making my auditors unhappy, and forcing me to enter the password
> over and over again. With a simple test it seems the password entered by
> the user also stays in memory, since it is able to reset a broken
> connection. Finding the password in memory is not trivial, but prevention
> is always preferred.

Sometimes the auditors are just wrong.  Say you're using Kerberos, so
you put service keys in... "keytab files" -- that's just like putting a
password in a file because they are equivalent.  Or server certificates
and their private keys... -- where are you going to put them if not in
some file?

Sure, you could put keys in a TPM (except they are orders of magnitude
too slow) or some other hardware token that doesn't suck (except those
are expensive).  But now you still need unattended access to the token,
and who cares what the keys _are_ when you can just _use_ them to
escalate privilege anyways??

Forcing attended operation of otherwise automatic systems is not a good
idea, and it is very expensive too.

A quick search turns up tools for finding cryptographic keys in memory.
Passwords can't be much harder.

> It might be an idea to wipe the password after the login, and decrypt/read
> it again if it needs to reconnect. Would this make the solution more
> secure? I had a quick look at the code and the patch would stay compact.
> Please let me know of doing this would make sense.

But you still wanted to automate things, no?

You can't protect from local hosts compromises.  You have to protect the
host.  Please, no security theater.  Do you think the day after a
massive data breach at your company you can tell the press "gee, I dunno
how they got the password, it was only loaded in memory!" and save it
from liability?  (If there's no liability either way, that's a different
problem.)

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-08-08 Thread Nico Williams

On Wed, Aug 08, 2018 at 07:19:34PM +0300, Heikki Linnakangas wrote:
> On 20/07/18 18:03, Andres Freund wrote:
> >It's much less the exit() that's unsafe, than the callbacks themselves,
> >right?  Perhaps just restate that we wouldn't want to trigger atexit
> >processing due to signal safety?
> 
> Well, in principle exit() is unsafe too, although you're right that in
> practice it's more likely the callbacks that would cause trouble. I reworded
> the comment to put more emphasis on the callbacks.

It's unsafe because it will:

 - flush stdio buffers
 - call atexit handlers (which might, e.g., free())
 - who knows what else

It's easy enough to decide to exit() or _exit().  If you tell can't
which to call from lexical context, then use a global volatile
sig_atomic_t variable to indicate that we're exiting from a signal
handler and check that variable in the quickdie() functions.

> So, with this commit, the various SIGQUIT quickdie handlers are in a better
> shape. The regular backend's quickdie() handler still calls ereport(), which
> is not safe, but it's a step in the right direction. And we haven't
> addressed the original complaint, which was actually about startup_die(),
> not quickdie().

Yes.  Would that snprintf() and vsnprintf() were async-signal-safe --
they can be, and some implementations probably are, but they aren't
required to be, then making ereport() safe would be easier.

Nico
--

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-08-08 Thread Nico Williams

On Wed, Aug 08, 2018 at 11:47:34AM -0500, Nico Williams wrote:
> Yes.  Would that snprintf() and vsnprintf() were async-signal-safe --
> they can be, and some implementations probably are, but they aren't
> required to be, then making ereport() safe would be easier.

So, I took a look at glibc's implementation for giggles.  It uses
malloc() (and free()) only in three cases: a) when printing floating
point numbers with very large/small exponents, b) when formatting long
wide strings into multi-byte strings, c) when formatting specifiers have
width asterisks.

Assuming locales are not lazily loaded, I think that's probably all the
reasonable cases where vsnprintf() could call async-signal-unsafe
functions or do async-signal-unsafe things.

Considering that PostgreSQL already has async-signal-unsafe signal
handlers, might as well assume that vsnprintf() is async-signal-safe and
just format strings into alloca()'ed buffers or even into fixed-sized
automatic char arrays, and be done.  Perhaps when a global volatile
sig_atomic_t is set denoting we're handling a signal, then use
vsnprintf() with a fixed-sized automatic buffer, otherwise malloc() one.

Nico
--

Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

2018-08-09 Thread Nico Williams

On Thu, Aug 09, 2018 at 03:00:26PM +0300, Danylo Hlynskyi wrote:
> The problem
> ==
> 
> [...]
> 
> We don't trust full-disk-encryption or any other transparent encryption,
> because of possible SQL injections.

Can you elaborate on this?

> Solution 1 (possibly can be used even now)
> 
> - perform full-disk encryption
> - perform encryption of column
> - add decrypting expression-based index with decryption key
> - limit ways on disclosing index internals. Ideally if no one except admin
> can do that
> - limit ways to read index definitions - so it's not possible for
> application to uncover decryption key from database itself, it should know
> it on it's own.

But... you have to have the decryption key(s) in memory at all times to
enable any write operations.  And plaintext as well at various times.

What does this gain you that FDE doesn't?

> Solution 2 (feature request)
> 
> - full-disk encryption is optional
> - data column is encrypted
> - index is decrypted by construction, but each it's block is encrypted,
> even in memory.
> - lookups over index do lazy index buffer decrypt and close buffers ASAP
> - make every query that has to touch encrypted column or encrypted index
> require decryption key. This means, SELECT, DELETE, UPDATE, INSERT, VACUUM,
> CLUSTER, CREATE INDEX, pg_dump, pg_restore all should have decryption key
> supplied in order to be executed. This also means, that autovacuum daemon
> can't work.

Same response.

> What do you think about both solutions? Is it hard to implement soluition 2?

They gain little or nothing over doing filesystem encryption in the OS
or even just plain FDE (see below).  They are not worthwhile.

You need to define your threat model.  What bad actors are you
protecting against?  What threats are you protecting against?

https://www.postgresql.org/message-id/20180622042337.GL4200%40localhost

Here are some threats you might choose to protect against:

1) passive attackers on the wire
2) active  attackers on the wire
3a) theft / compromise of storage devices
3b) compromise of decommissioned storage devices
3c) theft of running server
4) compromised backup storage
5) bad / compromised clients
6) bad / compromised DBAs or sysadmins
7) side channel exploits
8) ??

(1) and (2) are taken care of by TLS.

(3a) is taken care of by FDE in controllers, say, or by physical
security.

(3b) is taken care of by proper decommissioning, but FDE helps.

(3c) you can't protect against if you have keys in memory.  You could
use client-side crypto, but you'll have more clients to worry about than
servers.  Physical security is your best option.  (And really, you don't
get any way to protect against law enforcement taking the devices.)

(4) is taken care of by encrypting backups, which requires no changes to
PG to get.

(5) is taken care of (to some degree) by server-side logic (triggers,
...).

(6)...  You can't protect against sysadmins, really, nor DBAs, but you
can use crypto on the *client*-side to get some protection.  Since the
PG client is very thin and dumb, the PG client can't easily do this.
The idea is to encrypt values and MAC/sign rows to prevent DBAs/
sysadmins seeing sensitive data or tampering with your data.

(7) one deals with by using crypto implementations built with side
channel protection, though, really, this is a very difficult subject in
general, especially since Spectre.

Nico
--

Re: POC for a function trust mechanism

2018-08-09 Thread Nico Williams

On Wed, Aug 08, 2018 at 01:15:38PM -0400, Tom Lane wrote:
> This is sort of a counter-proposal to Noah's discussion of search path
> security checking in <20180805080441.gh1688...@rfd.leadboat.com>.
> (There's no technical reason we couldn't do both things, but I think
> this'd be more useful to most people.)

So, this is why I always fully-qualify all references to functions,
tables, etc.  I also always set a search_path on each function just in
case I accidentally leave a non-fully-qualified symbol.

I would like to have a way to request that all non-fully-qualified
symbols be resolved at function create/replace time and that the
resolution results be made permanent for the function.  If I have
several schemas in a search_path at function definition time, this would
not allow me to move dependencies around without replacing the
dependents -- that's OK for me.

Nico
--

Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

2018-08-09 Thread Nico Williams

On Thu, Aug 09, 2018 at 02:34:07PM -0600, Bear Giles wrote:
> Some regulatory standards require all UII, even all PII, information be
> encrypted within the database, not just on encrypted media. That's to
> reduce exposure even if someone gets access to a live server, e.g., via SQL
> Injection. (The perennial #1 risk for software vulnerabilities.)

My preference for dealing with SQL Injection is to not provide direct
SQL access, but to use PostgREST exporting a schema that has only PG SQL
functions encapsulating all supported queries.  You just can't have
injection with such an access layer because you don't get to send SQL to
the server (because you don't get to send SQL to PostgREST).  It really
helps that PostgREST is written in Haskell.

That said, sure, if you have SQL Injection issues, then encrypting in
the database will do provided that there's no transparent way to access
the data (otherwise you've gained nothing).  That basically means you're
doing all the crypto on the client.

If you're doing all the crypto on the client, then your options for
indexing are very limited indeed.  To avoid offline dictionary attacks
you have to index MAC'ed values, effectively.  You can still do free
text indexing, provided you MAC each word.  MAC == message
authentication code, really, it's a keyed hash function, typically HMAC,
UMAC, or some such.  You could also index ciphertext, provided it has an
authentication tag, but you don't gain anything versus just indexing the
authentication tag.

> I know the government required UII encryption in its databases when I last
> [...]

Usually regulations are not quite as prescriptive as that, though
there's always a discussion to be had with the regulators/auditors when
you deviate from the norm.  You're generally not precluded from having
better solutions than is the norm.

Nico
--

Re: NetBSD vs libxml2

2018-08-11 Thread Nico Williams

On Sat, Aug 11, 2018 at 01:18:26PM -0400, Tom Lane wrote:
> In a moment of idle curiosity, I tried to build PG --with-libxml
> on NetBSD-current (well, mostly current, from May or so).
> The configure script blew up, complaining that it couldn't execute
> a test program.  Investigation showed that xml2-config reports this:
> 
> $ xml2-config --libs
> -Wl,-R/usr/pkg/lib -L/usr/pkg/lib -lxml2 -L/usr/lib -lz -L/usr/lib -llzma 
> -L/usr/lib -lm
> 
> and we're only paying attention to the -L switches out of that.
> So we successfully link to /usr/pkg/lib/libxml2.so, but then
> execution fails for lack of rpath pointing at /usr/pkg/lib.
> 
> We could fix this by teaching configure to absorb -Wl,-R... switches
> into LDFLAGS from xml2-config's output, and that seems to make things
> work, but I wonder whether we should or not.  This seems like a new height
> of unfriendliness to non-default packages on NetBSD's part, and it's a bit
> hard to believe the behavior will make it to a formal release.  I don't
> see any comparable behavior on FreeBSD for instance --- it puts packages'
> libraries into /usr/local/lib, but that seems to be searched automatically
> without additional switches beyond -L.  Don't have an easy way to check
> things on OpenBSD.
> 
> Thoughts?

-Wl,-R (and friends, like -Wl,-rpath) is part and parcel of dynamic
linking, and are precious.  All software should honor these.

That non-default packages don't end up in /usr is a perfectly legitimate
thing for a distribution to do.

More importantly, a site-build that uses a non-system location for e.g.
locally-patched open source packages, is a perfectly legitimate thing
for _users_ to do.  It isn't nice to force them to hack a project's
./configure file or work out precious envvar settings to make that
project support non-system locations for dependencies.

I guess the problem here is that xml2-config is (like so many *-config
programs) broken by not having a way to get ld flags and libs
separately...  Which would be why ./configure is parsing the output of
xml2-config --libs.

The better thing to do would be to take all the words from xml2-config
--libs that match -l* and put them in LIBS while all others go into
LDFLAGS.  It's safer to assume that -l* means "link this in" than that
there won't be new linker options other than -L or -l.

I'll note too that -lxml2 is hardcoded in ./configure.in.  That's not
right either.

IMO this is just a minor bug in PG.

I'm willing to write a patch after lunch.  In ./configure.in this:

for pgac_option in `$XML2_CONFIG --libs`; do
  case $pgac_option in
-L*) LDFLAGS="$LDFLAGS $pgac_option";;
  esac
done

should change to:

for pgac_option in `$XML2_CONFIG --libs`; do
  case $pgac_option in
-l*) LDLIBS="$LDLIBS $pgac_option";;
*) LDFLAGS="$LDFLAGS $pgac_option";;
  esac
done

More changes are needed to stop hard-coding -lxml2.

I can write a patch if you like.

Nico
--

Re: Facility for detecting insecure object naming

2018-08-11 Thread Nico Williams

On Sat, Aug 11, 2018 at 12:47:05PM -0700, Noah Misch wrote:
> -- (3) "SET search_path" with today's code.
> --
> -- Security and reliability considerations are the same as (2).  Today, this
> -- reduces performance by suppressing optimizations like inlining.

Out of curiosity, why does this suppress inlining?

Anyways, my preference would be to have syntax by which to say: resolve
at declaration time using the then-in-effect search_path and store
as-qualified.  This could just be SET search_path without an assignment.

  CREATE FUNCTION ... AS $$ ... $$ SET search_path;

Another possibility would be to have a way to set a search_path for all
expressions in a given schema, something like:

  SET SCHEMA my_schema DEFAULT search_path = ...;

which would apply to all expressions in schema elements in schema
"my_schema":

 - CHECK expressions
 - INDEX expressions
 - VIEWs and MATERIALIZED VIEWs
 - FUNCTION and STORED PROCEDURE bodies
 - ...

  CREATE SCHEMA IF NOT EXISTS my_schema;

  SET SCHEMA my_schema DEFAULT search_path = my_schema, my_other_schema;

  CREATE OR REPLACE FUNCTION foo() ... AS $$ ... $$;

  ...

Nico
--

Re: Facility for detecting insecure object naming

2018-08-11 Thread Nico Williams

On Wed, Aug 08, 2018 at 10:47:04AM -0400, Tom Lane wrote:
> Isaac Morland  writes:
> > While I'm asking, does anybody know why this isn't the default, especially
> > for SECURITY DEFINER functions?
> 
> It might fix some subset of security issues, but I think that changing
> the default behavior like that would break a bunch of other use-cases.
> It'd be especially surprising for such a thing to apply only to
> SECURITY DEFINER functions.

Some projects consider breaking backwards compatibility to fix security
problems (within reason, and with discussion) to be a fair thing to do.

Already people have to qualify their apps for every release of PG.  I
think this problem very much deserves a good solution.

Nico
--

Re: NetBSD vs libxml2

2018-08-11 Thread Nico Williams

On Sat, Aug 11, 2018 at 01:12:09PM -0500, Nico Williams wrote:
> I'm willing to write a patch after lunch.  In ./configure.in this:
> 
> for pgac_option in `$XML2_CONFIG --libs`; do
>   case $pgac_option in
> -L*) LDFLAGS="$LDFLAGS $pgac_option";;
>   esac
> done
> 
> should change to:
> 
> for pgac_option in `$XML2_CONFIG --libs`; do
>   case $pgac_option in
> -l*) LDLIBS="$LDLIBS $pgac_option";;
> *) LDFLAGS="$LDFLAGS $pgac_option";;
>   esac
> done
> 
> More changes are needed to stop hard-coding -lxml2.

Actually, no, no more changes are needed.  The -lxml2 comes from:

1193 if test "$with_libxml" = yes ; then
1194   AC_CHECK_LIB(xml2, xmlSaveToBuffer, [], [AC_MSG_ERROR([library 'xml2' 
(version >= 2.6.23) is required for XML support])])
1195 fi

in configure.in, and I think contrib/xml2/Makefile needs to hardcode it.

So the above quoted change is all that is needed, plus re-run autoconf.

See attached.  (I'm not including unrelated changes made to configure by
autoconf.)

Nico
-- 
>From 995ee1dbbc0ee9af7bbb24728f09ec022696bba0 Mon Sep 17 00:00:00 2001
From: Nicolas Williams 
Date: Sat, 11 Aug 2018 15:52:19 -0500
Subject: [PATCH] Keep rpath et. al. from xml2-config --libs

Instead of looking for -L* words to put in to LDFLAGS and everthing else
into LIBS, look for -l* words to put into LIBS and everything else into
LDFLAGS.
---
 configure| 3 ++-
 configure.in | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 2665213..210f739 100755
--- a/configure
+++ b/configure
@@ -8009,7 +8009,8 @@ fi
 done
 for pgac_option in `$XML2_CONFIG --libs`; do
   case $pgac_option in
--L*) LDFLAGS="$LDFLAGS $pgac_option";;
+-l*) LIBS="$LIBS $pgac_option";;
+*) LDFLAGS="$LDFLAGS $pgac_option";;
   esac
 done
   fi
diff --git a/configure.in b/configure.in
index 397f6bc..b3f5c6c 100644
--- a/configure.in
+++ b/configure.in
@@ -902,7 +902,8 @@ if test "$with_libxml" = yes ; then
 done
 for pgac_option in `$XML2_CONFIG --libs`; do
   case $pgac_option in
--L*) LDFLAGS="$LDFLAGS $pgac_option";;
+-l*) LIBS="$LIBS $pgac_option";;
+*) LDFLAGS="$LDFLAGS $pgac_option";;
   esac
 done
   fi
-- 
2.7.4

Re: NetBSD vs libxml2

2018-08-13 Thread Nico Williams

[Quoting out of order.]

On Mon, Aug 13, 2018 at 11:16:23AM -0400, Tom Lane wrote:
> Robert Haas  writes:
>
> I'm not, personally, eager to do that work for a requirement which
> somehow hasn't surfaced on any other platform, nor on any previous
> NetBSD release.  I think NetBSD is way out in left field here.

It's not just about the distros.  It's also about sites that build and
install alternate versions of things that PG might depend on.  I've seen
that many times.  It is PG that is being hostile to them, not NetBSD
that is being hostile to PG.

Building PG outside a distro, with libxml2 outside a distro, ought to be
possible, but currently isn't without hacking on PG.  That's not good.

In any case, the patch I posted is trivial and small and should do the
job.

> > I kind of agree with Nico: why do we think we get to tell operating
> > system distributions which switches they're allowed to need to make
> > things work?  The point of things like pg_config and xmlconfig is to
> > reveal what is needed.  If we editorialize on that, we do so at our
> > own risk.
> 
> Well, the issue is that new kinds of switches introduce new potential
> for bugs.  In the case of -Wl,-R..., I'm not even sure that you can
> write that more than once per link, so absorbing one from xml2-config
> might well break things on some platforms.  Or, if you can write more
> than one, we'd need to make sure they end up in a desirable order.
> (cf commit dddfc4cb2 which fixed similar issues for -L switches;
> it looks to me like the current coding would in fact fail to order
> our $(rpath) correctly against one absorbed from xml2, and it would
> not be trivial to fix that.)

I believe no new options have _ever_ been added to linker-editors for
specifying dependencies, but lots of options for other purposes have
been added over the last several decades.

The link-editors are free to add new linker options.  There is nothing
we can do about that.

The fundamental problem here is that *-config programs should have had
separate --libs and --ldflags options, but don't, obligating us (and
others too) to parse the output of the --libs option.

What we need is an approach to parsing the output of the --libs option
that is likely to be more stable.  I believe looking for words that
start with -l is better than looking for words that start with -L,
because there haven't been new ways to specify dependencies ever, but
there have been tons of new linker-editor options for other things.  It
follows that this approach should be more stable.

And yes, I understand that if the linker-editors ever add new options
for specifying dependencies then we'll be here again.  It just seems a)
unlikely, b) not that burdensome if it does ever happen.

Aside:  It is truly shocking that in a world where, for better or worse,
autoconf is so prevalent, *-config programs still fail to
separate --libs and --ldflags.  After all, the separation of
LIBS (not precious) and LDFLAGS (precious), is very much a
tradition in autoconf'ed projects, and autoconf very much
supports it.

Nico
--

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-06-06 Thread Nico Williams

[Re-send; first attempt appears to have hit /dev/null somewhere.  My
apologies if you get two copies.]

I've finally gotten around to rebasing this patch and making the change
that was requested, which was: merge the now-would-be-three deferral-
related bool columns in various pg_catalog tables into one char column.

Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
(deferral).

All tests (make check) pass.

Sorry for the delay in doing this!

Incidentally, I had to do commit-by-commit rebasing to make the rebase
easier.  I have a shell function I use for this, if anyone wants a copy
of it -- sometimes it's much easier to do this than to do one huge jump.

Nico
-- 
>From 20323d6f19601f5471eb4db7570af8d3342b627d Mon Sep 17 00:00:00 2001
From: Nicolas Williams 
Date: Tue, 3 Oct 2017 00:33:09 -0500
Subject: [PATCH] Add ALWAYS DEFERRED option for CONSTRAINTs

and CONSTRAINT TRIGGERs.

This is important so that one can have triggers and constraints that
must run after all of the user/client's statements in a transaction
(i.e., at COMMIT time), so that the user/client may make no further
changes (triggers, of course, still can).
---
 doc/src/sgml/catalogs.sgml |  39 -
 doc/src/sgml/ref/alter_table.sgml  |   4 +-
 doc/src/sgml/ref/create_table.sgml |  10 ++-
 doc/src/sgml/ref/create_trigger.sgml   |   2 +-
 doc/src/sgml/trigger.sgml  |   3 +-
 src/backend/bootstrap/bootparse.y  |   6 +-
 src/backend/catalog/heap.c |   3 +-
 src/backend/catalog/index.c|  27 +++---
 src/backend/catalog/information_schema.sql |  12 ++-
 src/backend/catalog/pg_constraint.c|  14 ++-
 src/backend/commands/indexcmds.c   |   6 +-
 src/backend/commands/tablecmds.c   |  59 +
 src/backend/commands/trigger.c |  42 +
 src/backend/commands/typecmds.c|   5 +-
 src/backend/nodes/copyfuncs.c  |   9 +-
 src/backend/nodes/equalfuncs.c |   9 +-
 src/backend/nodes/outfuncs.c   |  10 ++-
 src/backend/parser/gram.y  | 136 +++--
 src/backend/parser/parse_utilcmd.c |  82 +++--
 src/backend/utils/adt/ruleutils.c  |  14 +--
 src/bin/pg_dump/pg_dump.c  |  66 +++---
 src/bin/pg_dump/pg_dump.h  |   8 +-
 src/bin/psql/describe.c|  51 +--
 src/bin/psql/tab-complete.c|   4 +-
 src/include/catalog/index.h|   5 +-
 src/include/catalog/pg_constraint.h|   6 +-
 src/include/catalog/pg_trigger.h   |   7 +-
 src/include/commands/trigger.h |   1 +
 src/include/nodes/parsenodes.h |  12 ++-
 src/include/utils/reltrigger.h |   3 +-
 src/test/regress/expected/alter_table.out  |  60 ++---
 src/test/regress/input/constraints.source  |  51 +++
 src/test/regress/output/constraints.source |  50 +++
 src/test/regress/sql/alter_table.sql   |   4 +-
 34 files changed, 487 insertions(+), 333 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021..e82e39b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2239,17 +2239,15 @@ SCRAM-SHA-256$:&l
  
 
  
-  condeferrable
-  bool
-  
-  Is the constraint deferrable?
- 
-
- 
-  condeferred
-  bool
+  condeferral
+  char
   
-  Is the constraint deferred by default?
+  Constraint deferral option:
+a = always deferred,
+d = deferrable,
+d = deferrable initially deferred,
+n = not deferrable
+  
  
 
  
@@ -7044,17 +7042,16 @@ SCRAM-SHA-256$:&l
  
 
  
-  tgdeferrable
-  bool
-  
-  True if constraint trigger is deferrable
- 
-
- 
-  tginitdeferred
-  bool
+  tgdeferrral
+  char
   
-  True if constraint trigger is initially deferred
+  
+   tgdeferral is
+   aalways deferred,
+   ddeferrable,
+   ideferrable initially deferred,
+   nnot deferrable.
+  
  
 
  
@@ -7119,7 +7116,7 @@ SCRAM-SHA-256$:&l

 When tgconstraint is nonzero,
 tgconstrrelid, tgconstrindid,
-tgdeferrable, and tginitdeferred are
+and tgdeferral are
 largely redundant with the referenced pg_constraint entry.
 However, it is possible for a non-deferrable trigger to be associated
 with a deferrable constraint: foreign key constraints can have some
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1cce00e..fd28a0d 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -55,7 +55,7 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL

Re: [PATCH v16] GSSAPI encryption support

2018-06-07 Thread Nico Williams

On Tue, Jun 05, 2018 at 12:16:31PM +1200, Thomas Munro wrote:
> On Sat, May 26, 2018 at 6:58 AM, Robbie Harwood  wrote:
> > Me and the bot are having an argument.  This should green Linux but I
> > dunno about Windows.
> 
> BTW if you're looking for a way to try stuff out on Windows exactly
> the way cfbot does it without posting a new patch to the mailing list,
> I put some instructions here:
> 
> https://wiki.postgresql.org/wiki/Continuous_Integration
> 
> The .patch there could certainly be improved (ideally, I think, it'd
> run the whole build farm animal script) but it's a start.

Cool!  Is there any reason that your patch for Travis and AppVeyor
integration is not just committed to master?

Is there a way to get my CF entry building in cfbot, or will it get
there when it gets there?

I know I can just apply your patch, push to my fork, and have Travis and
AppVeyor build it.  And I just might, but cfbot is a neato service!

Nico
--

Re: [PATCH v16] GSSAPI encryption support

2018-06-07 Thread Nico Williams

On Fri, Jun 08, 2018 at 10:11:52AM +1200, Thomas Munro wrote:
> On Fri, Jun 8, 2018 at 9:00 AM, Nico Williams  wrote:
> > Cool!  Is there any reason that your patch for Travis and AppVeyor
> > integration is not just committed to master?
> 
> I think that's a good idea and I know that some others are in favour.
> The AppVeyor one is not good enough to propose just yet: it's
> cargo-culted and skips a test that doesn't pass and only runs the
> basic check tests (not contribs, isolation, TAP etc).  Fortunately
> Andrew Dunstan has recently figured out how to run the official build
> farm script inside AppVeyor[1], and it looks like we might be close to
> figuring out that one test that doesn't work.  That makes me wonder if
> we should get the Travis version to use the build farm scripts too.
> If we can get all of that sorted out, then yes, I would like to
> propose that we stick the .XXX.yml files in the tree.  Another thing
> not yet explored is Travis's macOS build support.

I use AppVeyor for Heimdal and for jq...  Maybe I can help out.

As for Travis' OS X support... the problem there is that their build
farm is very small, so using theirs means waiting and waiting.

> Someone might argue that we shouldn't depend on particular external
> services, or that there are other CI services out there and we should
> use those too/instead for some reason, or that we don't want all that
> junk at top level in the tree.  But it seems to me that as long as
> they're dot prefixed files, we could carry control files for any
> number of CI services without upsetting anyone.  Having them in the
> tree would allow anyone who has a publicly accessible git repo
> (bitbucket, GitHub, ...) to go to any CI service that interests them
> and enable it with a couple of clicks.

Carrying the .yml files causes no harm beyond dependence, but that's a
nice problem to have when the alternative is to not have a CI at all.

> Then cfbot would still need to create new branches automatically (that
> is fundamentally what it does: converts patches on the mailing list
> into branches on GitHub), but it wouldn't need to add those control
> files anymore, just the submitted patches.

You wouldn't need it to.  Instead the CF page could let submitters link
their CI status pages/buttons...

> > Is there a way to get my CF entry building in cfbot, or will it get
> > there when it gets there?
> 
> Urgh, due to a bug in my new rate limiting logic it stopped pushing
> new branches for a day or two.  Fixed, and I see it's just picked up
> your submission #1319.  Usually it picks things up within minutes (it
> rescans threads whenever the 'last mail' date changes on the
> Commitfest web page), and then also rechecks each submission every
> couple of days.

Thanks!

> In a nice demonstration of the complexity of these systems, I see that
> the build for your submission on Travis failed because apt couldn't
> update its package index because repo.mongodb.org's key has expired.
> Other recent builds are OK so that seems to be a weird transient
> failure; possibly out of data in some cache, or some fraction of their
> repo server farm hasn't been updated yet or ... whatever.  Bleugh.

Oof.

> > I know I can just apply your patch, push to my fork, and have Travis and
> > AppVeyor build it.  And I just might, but cfbot is a neato service!
> 
> Thanks.  The next step is to show cfbot's results in the Commitfest
> app, and Magnus and I have started working on that.  I gave a talk
> about all this at PGCon last week, and the slides are up[2] in case
> anyone is interested:

OK.  I think that will be a huge improvement.  I find CF to be fantastic
as it is, but this will make it even better.

Thanks,

Nico
--

Re: [PATCH v16] GSSAPI encryption support

2018-06-11 Thread Nico Williams

On Mon, Jun 11, 2018 at 09:27:17AM -0400, Robert Haas wrote:
> On Thu, Jun 7, 2018 at 6:11 PM, Thomas Munro
>  wrote:
> >> Cool!  Is there any reason that your patch for Travis and AppVeyor
> >> integration is not just committed to master?
> >
> > I think that's a good idea and I know that some others are in favour.
> 
> One problem is that was discussed at PGCon it commits us to one
> particular build configuration i.e. one set of --with-whatever options
> to configure.  It's not bad to provide a reasonable set of defaults,
> but it means that patches which are best tested with some other set of
> values will have to modify the file (I guess?). Patches that need to
> be tested with multiple sets of flags are ... maybe just out of luck?

Hmm, that's not really true.  You can have a build and test matrix with
more than one row in it.

For example, look at: https://travis-ci.org/heimdal/heimdal

You'll see that Heimdal's Travis-CI integration has four builds:

 - Linux w/ GCC
 - Linux w/ Clang
 - OS X w/ Clang
 - Linux code coverage w/ GCC

We could easily add more options for Heimdal, if we felt we needed to
build and test more with Travis-CI.

Appveyor also has matrix support (though I'm not using it in Heimdal's
Appveyor-CI integration).

Now, of course if we had a very large set of configurations to test,
things might get slow, and the CIs might find it abusive.  It would be
best to use a maximal build configuration and go from there.  So, for
example, two configurations, one with and w/o JIT, but with all the
optional libraries (GSS, LDAP, ICU, Perl, Python, ...), and with two
different compilers (GCC and Clang, with Clang only for the JIT), plus
one OS X build (with JIT), and so on:

 - Linux w/ GCC
 - Linux w/ Clang (   JIT)
 - Linux w/ Clang (no JIT)
 - Linux code coverage
 - OS X  w/ Clang (   JIT)

and similarly for Windows on Appveyor.

> I really don't understand the notion of putting the build script
> inside the source tree.  It's all fine if there's One Way To Do It but
> often TMTOWTDII.  If the build configurations are described outside
> the source tree then you can have as many of them as you need.

Well, all the free CIs like Travis and Appveyor do it this way.  You
don't have to *use* it just because the .yml files are in the source
tree.  But you have to have the .yml files in the source tree in order
to use these CIs.  It'd be nice to be able to point somewhere else for
them, but whatever, that's not something we get much choice in at this
time.

The .yml files are unobstrusive anyways, and it's handy to have them
in-tree anyways.  It also makes it easier to do things like:

 - get build passing/failing buttons on wiki / build status pages 

 - make sure that the .yml files stay up to date as the source tree gets
   changed

It also makes it somewhat easier to get hooked on github and such, but
a bit of discipline will make that a non-issue.

Nico
--

Re: [PATCH v16] GSSAPI encryption support

2018-06-11 Thread Nico Williams

On Mon, Jun 11, 2018 at 01:31:12PM -0400, Andrew Dunstan wrote:
> On 06/11/2018 01:13 PM, Nico Williams wrote:
> >Well, all the free CIs like Travis and Appveyor do it this way.  You
> >don't have to *use* it just because the .yml files are in the source
> >tree.  But you have to have the .yml files in the source tree in order
> >to use these CIs.  It'd be nice to be able to point somewhere else for
> >them, but whatever, that's not something we get much choice in at this
> >time.
> 
> That's not true, at least for Appveyor (can't speak about travis - I have no
> first hand experience). For appveyor, you can supply a custom appveyor.yml
> file, which can be a complete URL. In fact, if you use a plain git source as
> opposed to one of the managed git services it supports, you have to do it
> that way - it ignores an appveyor.yml in your repo. I found this out the
> very hard way over the last few days, and they very kindly don't warn you at
> all about this.

OK, that's.. nice, maybe, I guess, but I'd still want version control
for these yml files -- why not have them in-tree?  I'd rather have them
in-tree unless there's a good reason not to have them there.

In other projects I definitely find it better to have these files
in-tree.

Nico
--

Re: [PATCH v18] GSSAPI encryption support

2018-06-11 Thread Nico Williams

On Mon, Jun 11, 2018 at 04:11:10PM -0400, Robbie Harwood wrote:
> Nico was kind enough to provide me with some code review.  This should
> those concerns (clarify short-read behavior and fixing error checking on
> GSS functions).

Besides the bug you fixed and which I told you about off-list (on IRC,
specifically), I only have some commentary that does not need any
action:

 - support for non-Kerberos/default GSS mechanisms

   This might require new values for gssmode: prefer-
   and require-.  One could always use SPNEGO if there
   are multiple mechanisms to choose from.  And indeed, you could just
   use SPNEGO if the user has credentials for multiple mechanism.

   (Because GSS has no standard mechanism _names_, this means making
   some up.  This is one obnoxious shortcoming of the GSS-API...)

 - when the SCRAM channel binding work is done, it might be good to add
   an option for TLS + GSS w/ channel binding to TLS and no gss wrap
   tokens

Nico
--

Re: ON CONFLICT DO NOTHING on pg_dump

2018-06-12 Thread Nico Williams

On Tue, Jun 12, 2018 at 09:05:23AM +, Ideriha, Takeshi wrote:
> >From: Surafel Temesgen [mailto:surafel3...@gmail.com] 
> >Subject: ON CONFLICT DO NOTHING on pg_dump
> 
> >Sometimes I have to maintain two similar database and I have to update one 
> >from the other and notice having the option to add ON CONFLICT DO NOTHING 
> >clause to >INSERT command in the dump data will allows pg_restore to be done 
> >with free of ignore error.
> 
> Hi,
> I feel like that on-conflict-do-nothing support is useful especially coupled 
> with --data-only option.
> Only the difference of data can be restored.

But that's additive-only.  Only missing rows are restored this way, and
differences are not addressed.

If you want restore to restore data properly and concurrently (as
opposed to renaming a new database into place or whatever) then you'd
want a) MERGE, b) dump to generate MERGE statements.  A concurrent data
restore operation would be rather neat.

Nico
--

Re: [PATCH v18] GSSAPI encryption support

2018-06-12 Thread Nico Williams

On Tue, Jun 12, 2018 at 12:36:01PM -0400, Robbie Harwood wrote:
> Nico Williams  writes:
> > On Mon, Jun 11, 2018 at 04:11:10PM -0400, Robbie Harwood wrote:
> >> Nico was kind enough to provide me with some code review.  This should
> >> those concerns (clarify short-read behavior and fixing error checking on
> >> GSS functions).
> >
> > Besides the bug you fixed and which I told you about off-list (on IRC,
> > specifically), I only have some commentary that does not need any
> > action:
> >
> >  - support for non-Kerberos/default GSS mechanisms
> >
> >This might require new values for gssmode: prefer-
> >and require-.  One could always use SPNEGO if there
> >are multiple mechanisms to choose from.  And indeed, you could just
> >use SPNEGO if the user has credentials for multiple mechanism.
> >
> >(Because GSS has no standard mechanism _names_, this means making
> >some up.  This is one obnoxious shortcoming of the GSS-API...)
> 
> As long as it's better than passing raw OIDs on the CLI...

Rite?

I think this can be a follow-on patch, though trying SPNEGO if the user
has credentials for multiple mechanisms (and SPNEGO is indicated) seems
simple enough to do now (no interface changes).

> >  - when the SCRAM channel binding work is done, it might be good to add
> >an option for TLS + GSS w/ channel binding to TLS and no gss wrap
> >tokens
> 
> I think both of these are neat ideas if they'll be used.  Getting GSSAPI
> encryption in shouldn't preclude either in its present form (should make
> it easier, I hope), but I'm glad to hear of possible future work as
> well!

This one can (must) wait.  It has some security benefits.  You get to
use GSS/Kerberos for authentication, but you get an forward security
you'd get from TLS (if the GSS mechanism doesn't provide it, which
Kerberos today does not).

Nico
--

Re: ON CONFLICT DO NOTHING on pg_dump

2018-06-18 Thread Nico Williams

On Fri, Jun 15, 2018 at 02:20:21AM +, Ideriha, Takeshi wrote:
> >From: Nico Williams [mailto:n...@cryptonector.com]
> >On Tue, Jun 12, 2018 at 09:05:23AM +, Ideriha, Takeshi wrote:
> >> Only the difference of data can be restored.
> >
> >But that's additive-only.  Only missing rows are restored this way, and 
> >differences are
> >not addressed.
> >
> >If you want restore to restore data properly and concurrently (as opposed to 
> >renaming
> >a new database into place or whatever) then you'd want a) MERGE, b) dump to
> >generate MERGE statements.  A concurrent data restore operation would be 
> >rather
> >neat.
> 
> I agree with you though supporting MERGE or ON-CONFLICT-DO-UPDATE seems hard 
> work.
> Only ON-CONCLICT-DO-NOTHING use case may be narrow.

Is it narrow, or is it just easy enough to add quickly?

And by the way, you don't need MERGE.  You can just generate INSERT/
UPDATE/DELETE statements -- MERGE is mainly an optimization on that, and
could wait until PG has a MERGE.

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Nico Williams

On Mon, Jun 11, 2018 at 06:22:22PM +0900, Masahiko Sawada wrote:
> As per discussion at PGCon unconference, I think that firstly we need
> to discuss what threats we want to defend database data against. If

We call that a threat model.  There can be many threat models, of
course.

> user wants to defend against a threat that is malicious user who
> logged in OS or database steals an important data on datbase this
> design TDE would not help. Because such user can steal the data by
> getting a memory dump or by SQL. That is of course differs depending
> on system requirements or security compliance but what threats do you
> want to defend database data against? and why?

This design guards (somewhat) againts the threat of the storage theft
(e.g., because the storage is remote).  It's a fine threat model to
address, but it's also a lot easier to address in the filesystem or
device drivers -- there's no need to do this in PostgreSQL itself except
so as to support it on all platforms regardless of OS capabilities.

Note that unless the pg_catalog is protected against manipulation by
remote storage, then TDE for user tables might be possible to
compromise.  Like so: the attacker manipulates the pg_catalog to
escalate privelege in order to obtain the TDE keys.  This argues for
full database encryption, not just specific tables or columns.  But
again, this is for the threat model where the storage is the threat.

Another similar thread model is dump management, where dumps are sent
off-site where untrusted users might read them, or even edit them in the
hopes that they will be used for restores and thus compromise the
database.  This is most easily addressed by just encrypting the backups
externally to PG.

Threat models where client users are the threat are easily handled by
PG's permissions system.

I think any threat model where DBAs are not the threat is just not that
interesting to address with crypto within postgres itself...

Encryption to public keys for which postgres does not have private keys
would be one way to address DBAs-as-the-thread, but this is easily done
with an extension...  A small amount of syntactic sugar might help:

  CREATE ROLE "bar" WITH (PUBLIC KEY "...");

  CREATE TABLE foo (
name TEXT PRIMARY KEY,
payload TEXT ENCRYPTED TO ROLE "bar" BOUND TO name
  );

but this is just syntactic sugar, so not that valuable.  On the other
hand, just a bit of syntactic sugar can help tick a feature checkbox,
which might be very valuable for marketing reasons even if it's not
valuable for any other reason.

Note that encrypting the payload without a binding to the PK (or similar
name) is very dangerous!  So the encryption option would have to support
some way to indicate what other plaintext to bind in (here the "name"
column).

Note also that for key management reasons it would be necessary to be
able to write the payload as ciphertext rather than as to-be-encrypted
TEXT.

Lastly, for a symmetric encryption option one would need a remote oracle
to do the encryption, which seems rather complicated, but in some cases
may well perform faster.

Nico
--

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-18 Thread Nico Williams

On Mon, Jun 18, 2018 at 07:38:13PM +0100, Dent John wrote:
> I commented to Corey (privately) that, while my rewrite extension has
> gotten me a server that responds quickly to aggregate queries, the
> constant need to refresh the supporting MVs means the system’s load
> average is constant and much higher than before. I’m happy with the
> tradeoff for now, but it’s a huge waste of energy, and I’m sure it
> must thrash my disk.
> 
> I’m very interested in what other people think of Corey’s idea.

I've written an alternative materialization extension (entirely as
PlPgSQL) based on PG's internals, but my version has a few big wins that
might help here.  I'm thinking of properly integrating it with PG.  Some
of the features include:

 - you can write triggers that update the materialization

   This is because the materialization is just a regular table in my
   implementation.

 - you can mark a view as needing a refresh (e.g., in a trigger)

 - you can declare a PK, other constraints, and indexes on a
   materialization

   The DMLs used to refresh a view concurrently can take advantage of
   the PK and/or other indexes to go fast.

 - you get a history table which records updates to the materialization

   This is useful for generating incremental updates to external
   systems.

Keeping track of refresh times should help decide whether to use or not
use a materialization in some query, or whether to refresh it first, or
not use it at all.

One of the things I'd eventually like to do is analyze the view query
AST to automatically generate triggers to update materializations or
mark them as needing refreshes.  A first, very very rough sketch of such
an analysis looks like this:

 - if the view query has CTEs
   -> create triggers on all its table sources to mark the
  materialization as needing a refresh

 - else if a table appears more than once as a table source in the view
   query
   -> create triggers on that table that mark the materialization as
  needing a refresh

 - else if a table appears anywhere other than the top-level
   -> create triggers .. mark as needing refresh

 - else if a table is a right-side of a left join
   -> create triggers .. mark as needing refresh

 - else if a table has no PK
   -> create triggers .. mark as needing refresh

 - else if the query has no GROUP BY, or only does a GROUP BY on this
   table and a list of columns prefixed by the table's PK
   -> rewrite the query to have WHERE eq conditions on values for the
  table's PK columns

  analyze this query

  if the result shows this table source as the first table in the
  plan
  -> create triggers on this table to update the materialization
 directly from querying the source view

 - else
   -> create triggers .. mark as needing refresh

Nico
--

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-19 Thread Nico Williams

On Tue, Jun 19, 2018 at 08:46:06AM +0100, Dent John wrote:
> I’m pretty impressed anything in this space can be written entirely in
> PlPGQSL!

https://github.com/twosigma/postgresql-contrib

PG is quite powerful!

I have even implemented a COMMIT TRIGGER in pure PlPgSQL.

You'll notice I make extensive use of record/table types.

> If you did integrate your implementation, it would be easy for my
> Extension to read from a table other than the one which it gets the MV
> definition from... Although having said that, if you went down the
> route you suggest, would not you make that “regular table” into a
> first class scheme object very much like Corey’s CONTINUOUS
> MATERIALIZED VIEW object concept?

I know nothing about the CONTINUOUS MATERIALIZED VIEW concept.  What
that would imply to me seems... difficult to achieve.  There will be
view queries that are difficult or impossible to automatically write
triggers for that update an MV synchronously.

> It is interesting that you can put triggers onto the table though, as
> that leads well in to use cases where it is desirable to “stack” MVs
> upon each other. (I’m not immediately sure whether such a use case is
> still needed in face of an always-up-to-date MV feature such as is
> described, but I’ve seen it elsewhere.)

I have done exactly this sort of MV chaining.

In my use case I had an MV of a nesting group transitive closure and
then another of a join between that and user group memberships to get a
complete user group transitive closure.

The transitive closure of nesting groups being computed via a RECURSIVE
CTE...  In principle one can understand such a query and automatically
write DMLs to update the MV on the fly (I've done this _manually_), but
the moment you do any bulk updates out of sequence you can't, and then
you have to refresh the view, so you see, I don't quite believe we can
have a true continuously materialized view :(

For me the key requirement is the ability to generate incremental
updates to an external system, but also the whole thing has to be fast.

> You say you’d like to base it off a VIEW’s AST (rather than, I
> presume, you must parse the reconstructed VIEW source text as SQL?),

PG already stores the AST.  There's no need to write a new parser when
PG already has one.  At the end of the day you need to analyze an AST
for the MV's source query in order to automatically write the triggers
to keep it updated (or mark it as needing a refresh).

> and I do agree — that’s probably the right direction... it does seem
> to me there is scope to leverage the “bottom half” of the ASSERTION
> stuff from Dave Fetter that Corey linked to — i.e., the part of it
> that manages triggers. Still leaves the AST crawling deciding what to
> actually do once a change is caught.

I'll search for this.

> Really good to hear about progress in this area.

Eh, I've not actually implemented any automatic generation of triggers
to update MVs.  I've written enough such triggers manually to believe
that *some* of them could be written by software.  If you look at my
sketch for how to do it, you'll notice that many of the sorts of queries
that one would choose to materialize... are not really amenable to this
treatment -- that's precisely because those make for the sorts of slow
queries that make you reach for materialization in the first place :(

But even so, automatically-generated triggers that mark an MV as needing
a refresh are always possible, and that is a huge improvement anyways,
especially if concurrent view refreshes can be made to go faster (by
having PKs on the MVs).  The idea is to have some sort of adaptive
automatic background, concurrent MV refresh running on a frequency based
in part of the amount of time it takes to refresh the VIEW.

BTW, MERGE would be a significant optimization for concurrent MV
refresh.  Think of MERGE as a statement that can scan a source, FULL
OUTER JOIN it to a target table, and for each row do an INSERT, UPDATE,
or DELETE -- this is 3x faster than the three INSERT/UPDATE/DELETE
statements you need to do the same work without a MERGE!

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams

On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
> On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
> > Note that unless the pg_catalog is protected against manipulation by
> > remote storage, then TDE for user tables might be possible to
> > compromise.  Like so: the attacker manipulates the pg_catalog to
> > escalate privelege in order to obtain the TDE keys.  This argues for
> > full database encryption, not just specific tables or columns.  But
> > again, this is for the threat model where the storage is the threat.
> 
> Yes, one big problem with per-column encryption is that administrators
> can silently delete data, though they can't add or modify it.

They can also re-add ("replay") deleted values; this can only be
defeated by also binding TX IDs or alike in the ciphertext.  And if you
don't bind the encrypted values to the PKs then they can add any value
they've seen to different rows.

One can protect to some degree agains replay and reuse attacks, but
protecting against silent deletion is much harder.  Protecting against
the rows (or the entire DB) being restored at a past point in time is
even harder -- you quickly end up wanting Merkle hash/MAC trees and key
rotation, but this complicates everything and is performance killing.

> > I think any threat model where DBAs are not the threat is just not that
> > interesting to address with crypto within postgres itself...
> 
> Yes, but in my analysis the only solution there is client-side
> encryption:

For which threat model?

For threat models where the DBAs are not the threat there's no need for
client-side encryption: just encrypt the storage at the postgres
instance (with encrypting device drivers or -preferably- filesystems).

For threat models where the DBAs are the threat then yes, client-side
encryption works (or server-side encryption to public keys), but you
must still bind the encrypted values to the primary keys, and you must
provide integrity protection for as much data as possible -- see above.

Client-side crypto is hard to do well and still get decent performance.
So on the whole I think that crypto is a poor fit for the DBAs-are-the-
threat threat model.  It's better to reduce the number of DBAs/sysadmins
and audit all privileged (and, for good measure, unprivileged) access.

Client-side encryption, of course, wouldn't be a feature of PG..., as PG
is mostly a very smart server + very dumb clients.  The client could be
a lot smarter, for sure -- it could be a full-fledged RDBMS, it could
even be a postgres instance accessing the real server via FDW.

For example, libgda, the GNOME Data Assistant, IIRC, is a smart client
that uses SQLite3 to access remote resources via virtual table
extensions that function a lot like PG's FDW.  This works well because
SQLite3 is embeddable and light-weight.  PG wouldn't fit that bill as
well, but one could start a PG instance to proxy a remote one via FDW,
with crypto done in the proxy.

>   http://momjian.us/main/writings/crypto_hw_use.pdf#page=97
> 
> You might want to look at the earlier slides too.

I will, thanks.

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams

On Wed, Jun 20, 2018 at 06:06:56PM -0400, Joe Conway wrote:
> On 06/20/2018 05:09 PM, Bruce Momjian wrote:
> > On Mon, Jun 18, 2018 at 09:49:20AM -0400, Robert Haas wrote:
> >> know the ordering of the values under whatever ordering semantics
> >> apply to that index.  It's unclear to me how useful such information
> > 
> > I don't think an ordered index is possible, only indexing of encrypted
> > hashes, i.e. see this and the next slide:
> 
> It is possible with homomorphic encryption -- whether we want to support
> that in core is another matter.

It's also possible using DNSSEC NSEC3-style designs.

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams

On Wed, Jun 20, 2018 at 06:19:40PM -0400, Joe Conway wrote:
> On 06/20/2018 05:12 PM, Bruce Momjian wrote:
> > On Mon, Jun 18, 2018 at 11:06:20AM -0400, Joe Conway wrote:
> > Even if they are encrypted with the same key, they use different
> > initialization vectors that are stored inside the encrypted payload, so
> > you really can't identify much except the length, as Robert stated.

Definitely use different IVs, and don't reuse them (or use cipher modes
where IV reuse is not fatal).

> The more you encrypt with a single key, the more fuel you give to the
> person trying to solve for the key with cryptanalysis.

With modern 128-bit block ciphers in modern cipher modes you'd have to
encrypt enough data to make this not a problem.  On the other hand,
you'll still have other reasons to do key rotation.  Key rotation
ultimately means re-encrypting everything.  Getting all of this right is
very difficult.

So again, what's the threat model?  Because if it's sysadmins/DBAs
you're afraid of, there are better things to do.

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams

On Thu, Jun 21, 2018 at 10:05:41AM +0900, Masahiko Sawada wrote:
> On Thu, Jun 21, 2018 at 6:57 AM, Nico Williams  wrote:
> > On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
> >> On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
> >> > Note that unless the pg_catalog is protected against manipulation by
> >> > remote storage, then TDE for user tables might be possible to
> >> > compromise.  Like so: the attacker manipulates the pg_catalog to
> >> > escalate privelege in order to obtain the TDE keys.  This argues for
> >> > full database encryption, not just specific tables or columns.  But
> >> > again, this is for the threat model where the storage is the threat.
> >>
> >> Yes, one big problem with per-column encryption is that administrators
> >> can silently delete data, though they can't add or modify it.
> >
> > They can also re-add ("replay") deleted values; this can only be
> > defeated by also binding TX IDs or alike in the ciphertext.  And if you
> > don't bind the encrypted values to the PKs then they can add any value
> > they've seen to different rows.
> 
> I think we could avoid it by implementations. If we implement
> per-column encryption by putting all encrypted columns out to another
> table like TOAST table and encrypting whole that external table then
> we can do per-column encryption without such concerns. Also, that way
> we can encrypt data when disk I/O even if we use per-column
> encryption. It would get a better performance. A downside of this idea
> is extra overhead to access encrypted column but it would be
> predictable since we have TOAST.

The case we were discussing was one where the threat model is that the
DBAs are the threat.  It is only in that case that the replay,
cut-n-paste, and silent deletion attacks are relevant.  Encrypting a
table, or the whole DB, on the server side, does nothing to protect
against that threat.

Never lose track of the threat model.

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Nico Williams

On Thu, Jun 21, 2018 at 10:14:54AM -0400, Bruce Momjian wrote:
> On Wed, Jun 20, 2018 at 04:57:18PM -0500, Nico Williams wrote:
> > Client-side crypto is hard to do well and still get decent performance.
> > So on the whole I think that crypto is a poor fit for the DBAs-are-the-
> > threat threat model.  It's better to reduce the number of DBAs/sysadmins
> > and audit all privileged (and, for good measure, unprivileged) access.
> 
> Yeah, kind of.  There is the value of preventing accidental viewing of
> the data by the DBA, and of course WAL and backup encryption are nice.

One generally does not use crypto to prevent "accidental" viewing of
plaintext, but to provide real security relative to specific threats.

If you stop at encrypting values with no integrity protection for the
PKs, and no binding to TX IDs and such, you will indeed protect against
accidental viewing of the plaintext, but not against a determined
malicious insider.

Is that worthwhile?  Remember: you'll have to reduce and audit sysadmin
& DBA access anyways.

There is also the risk that users won't understand the limitations of
this sort of encryption feature and might get a false sense of security
from [mis]using it.

I'd want documentation to make it absolutely clear that such a feature
is only meant to reduce the risk of accidental viewing of plaintext by
DBAs and not a real security feature.

Nico
--

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Nico Williams

On Fri, May 25, 2018 at 08:41:46PM +0900, Moon, Insung wrote:
> Issues on data encryption of PostgreSQL
> ==
> Currently, in PostgreSQL, data encryption can be using pgcrypto Tool.
> However, it is inconvenient to use pgcrypto to encrypts data in some cases.
> 
> There are two significant inconveniences.
> 
> First, if we use pgcrypto to encrypt/decrypt data, we must call pgcrypto 
> functions everywhere we encrypt/decrypt.

Not so.  VIEWs with INSTEAD OF triggers allow you to avoid this.

> Second, we must modify application program code much if we want to do
> database migration to PostgreSQL from other databases that is using
> TDE.

Not so.  See above.

However, I have at times been told that I should use SQL Server or
whatever because it has column encryption.  My answer is always that it
doesn't help (see my other posts on this thread), but from a business
perspective I understand the problem: the competition has a shiny (if
useless) feature XYZ, therefore we must have it also.

I'm not opposed to PG providing encryption features similar to the
competition's provided the documentation makes their false-sense-of-
security dangers clear.

Incidentally, PG w/ pgcrypto and FDW does provide everything one needs
to be able to implement client-side crypto:

 - use PG w/ FDW as a client-side proxy for the real DB
 - use pgcrypto in VIEWs with INSTEAD OF triggers in the proxy
 - access the DB via the proxy

Presto: transparent client-side crypto that protects against DBAs.  See
other posts about properly binding ciphertext and plaintext.

Protection against malicious DBAs is ultimately a very difficult thing
to get right -- if you really have DBAs as a threat and take that threat
seriously then you'll end up implementing a Merkle tree and performance
will go out the window.

> In these discussions, there were requirements necessary to support TDE in 
> PostgreSQL.
> 
> 1) The performance overhead of encryption and decryption database data must 
> be minimized
> 2) Need to support WAL encryption.
> 3) Need to support Key Management Service.

(2) and full database encryption could be done by the filesystem /
device drivers.  I think this is a much better answer than including
encryption in the DB just because it means not adding all that
complexity to PG, though it's not as portable as doing it in the DB (and
this may well be a winning argument).

What (3) looks like depends utterly on the threat model.  We must
discuss threat models first.

The threat models will drive the design, and (1) will drive some
trade-offs.

> Therefore, I'd like to propose the new design of TDE that deals with
> both above requirements.  Since this feature will become very large,
> I'd like to hear opinions from community before starting making the
> patch.

Any discussion of cryptographic applications should start with a
discussion of threat models.  This is not a big hurdle.

Nico
--

Re: libpq compression

2018-06-21 Thread Nico Williams

On Thu, Jun 21, 2018 at 10:12:17AM +0300, Konstantin Knizhnik wrote:
> On 20.06.2018 23:34, Robbie Harwood wrote:
> >Konstantin Knizhnik  writes:
> >Well, that's a design decision you've made.  You could put lengths on
> >chunks that are sent out - then you'd know exactly how much is needed.
> >(For instance, 4 bytes of network-order length followed by a complete
> >payload.)  Then you'd absolutely know whether you have enough to
> >decompress or not.
> 
> Do you really suggest to send extra header for each chunk of data?
> Please notice that chunk can be as small as one message: dozen of bytes
> because libpq is used for client-server communication with request-reply
> pattern.

You must have lengths, yes, otherwise you're saying that the chosen
compression mechanism must itself provide framing.

I'm not that familiar with compression APIs and formats, but looking at
RFC1950 (zlib) for example I see no framing.

So I think you just have to have lengths.

Now, this being about compression, I understand that you might now want
to have 4-byte lengths, especially given that most messages will be
under 8KB.  So use a varint encoding for the lengths.

Nico
--

Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-06-21 Thread Nico Williams

On Thu, Jun 21, 2018 at 07:46:35PM -0400, Bruce Momjian wrote:
> Agreed.  I can see from this discussion that we have a long way to go
> before we can produce something clearly useful, but it will be worth it.

Let's start with a set of threat models then.  I'll go first:

1) storage devices as the threat
   a) theft of storage devices
   b) malicious storage device operators

2) malicious backup operators as the threat

3) malicious servers as the threat
   a) compromised servers
   b) insider threat -- rogue admins

4) malicious clients as the threat
   a) compromised clients
   b) insider threat

5) passive adversaries on the network as the threat

6) active adversaries on the network as the threat

7) adversaries on the same host as the server or client

Am I missing any?

For example, modern version control systems that use a Merkle hash tree
have malicious servers as part of their threat model.  Git clients, for
example, can detect non-fast-forward history changes upstream.

For another example, DNSSEC also provides protection against malicious
servers by authenticating not the servers but the _data_.  DNSSEC is a
useful case in point because it's effectively a key/value database that
stores somewhat relational data...

Clearly PG currently covers threat models 4 through 7:

 - passive adversaries on the network (addressed via TLS)
 - active adversaries on the network (addressed via TLS)
 - local adversaries (addressed by standard OS user process isolation)
 - malicious clients (addressed via authentication and authorization)

(1) and (2) can be covered externally:

 - protection against malicious storage or backup operators is trivial
   to provide: just use encrypting filesystems or device drivers, and
   encrypt backups using standard technologies.

One shortcoming of relying on OS functionality for protection against
malicious storage is that not all OSes may provide such functionality.
This could be an argument for implementing full, transparent encryption
for an entire DB in the postgres server.  Not a very compelling
argument, but that's just my opinion -- reasonable people could differ
on this.

PG also authenticates servers, but does nothing to authenticate the data
or functions of the server.  So while PG protects against illegitimate
server impersonators as well as TLS/GSS/SCRAM/... will afford, it does
not protect against rogue server admins nor against compromised servers.

That leaves (3) as the only threat model not covered.  It's also the
most challenging threat model to deal with.

Now, if you're going to protect against malicious servers (insiders)...

 - you can't let the server see any sensitive plaintext (must encrypt it)
 - which includes private/secret keys (the server can't have them, only
   the clients can)
 - you have to not only encrypt but provide integrity protection for
   ciphertext as well as unencrypted plaintext
 - decryption and integrity protection validation can only be done on
   the client (because only they have the necessary secrets!)

There are a lot of choices to make here that will greatly affect any
analysis of the security of the result.

A full analysis will inexorably lead to one conclusion: it's better to
just not have malicious servers (insiders), because if you really have
to defend against them then the only usable models of how to apply
cryptography to the problem are a) Git-like VCS, b) DNSSEC, and both are
rather heavy-duty for a general-purpose RDBMS.

So I think for (3) the best answer is to just not have that problem:
just reduce and audit admin access.

Still, if anyone wants to cover (3), I argue that PG gives you
everything you need right now: FDW and pgcrypto.  Just build a
solution where you have a PG server proxy that acts as a smart
client to untrusted servers:

  +---+   ++
   ++ |   |   ||
   || | postgres  |   | postgres   |
   || | (proxy)   |   | (real server)  |
   | Client |>|   |-->||
   || |   |   ||
   || | (keys here)   |   | (no keys here) |
   ++ |   |   ||
  +---+   ++

In the proxy use FDW (to talk to the real server) and VIEWs with INSTEAD
OF triggers to do all crypto transparently to the client.

Presto.  Transparent crypto right in your queries and DMLs.

But, you do have to get a number of choices right as to the crypto, and
chances are you won't provide integrity protection for the entire DB
(see above).

Nico
--

Re: libpq compression

2018-06-22 Thread Nico Williams

On Fri, Jun 22, 2018 at 10:18:12AM +0300, Konstantin Knizhnik wrote:
> On 22.06.2018 00:34, Nico Williams wrote:
> >So I think you just have to have lengths.
> >
> >Now, this being about compression, I understand that you might now want
> >to have 4-byte lengths, especially given that most messages will be
> >under 8KB.  So use a varint encoding for the lengths.
>
> No explicit framing and lengths are needed in case of using streaming
> compression.
> There can be certainly some kind of frames inside compression protocol
> itself, but it is intrinsic of compression algorithm.

I don't think that's generally true.  It may be true of the compression
algorithm you're working with.  This is fine, of course, but plugging in
other compression algorithms will require the authors to add framing.

Re: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-06-22 Thread Nico Williams

On Fri, Jun 22, 2018 at 05:31:44AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > Let's start with a set of threat models then.  I'll go first:
> 
> Thank you so much for summarizing the current situation.  I'd
> appreciate it if you could write this on the PostgreSQL wiki, when the
> discussion has settled somehow.

Sure, that's a good idea.

> >  - local adversaries (addressed by standard OS user process isolation)
> 
> Does this also mean that we don't have to worry about the following?
> 
> * unencrypted data in the server process memory and core files
> * passwords in .pgpass and recovery.conf (someone familiar with PCI
>   DSS audit said this is a problem)
> * user data in server logs

Short of using things like Intel SGX or homomorphic encryption, I don't
think we can do anything about plaintext in memory -- at some point it
has to be there, therefore it is as vulnerable as the host OS makes it.

Users can always run only the one postgres instance and nothing else
(and no other non-admin users) on the host to reduce local attack
surface to zero.

So, yes, I think this flavor of local vulnerability should be out of
scope for PG.

> > One shortcoming of relying on OS functionality for protection against
> > malicious storage is that not all OSes may provide such functionality.
> > This could be an argument for implementing full, transparent encryption
> > for an entire DB in the postgres server.  Not a very compelling
> > argument, but that's just my opinion -- reasonable people could differ
> > on this.
> 
> Yes, this is one reason I developed TDE in our product.  And
> in-database encryption allows optimization by encrypting only user
> data.

I understand this motivation.  I wouldn't reject this out of hand, even
though I'm not exactly interested either.

Can you keep the impact on the codebase isolated and limited, and the
performance impact when disabled to zero?

> > So I think for (3) the best answer is to just not have that problem:
> > just reduce and audit admin access.
> > 
> > Still, if anyone wants to cover (3), I argue that PG gives you
> > everything you need right now: FDW and pgcrypto.  Just build a
> > solution where you have a PG server proxy that acts as a smart
> > client to untrusted servers:
> 
> Does sepgsql help?

Any functionality in PG that allows DBAs to manage storage, sessions,
..., without having to see table data will help.  It doesn't ahve to be
tied to trusted OS functionality.

I've not looked at SEP [0] so I don't know if it helps.  I would prefer
that PG simply have native functionality to allow this sort of
separation -- as I'm not a DBA, I don't really know if PG has this.

[0] https://wiki.postgresql.org/wiki/SEPostgreSQL_SELinux_Overview

> Should a malfunctioning or buggy application be considered as as a
> threat?  That's what sql_firewall extension addresses.

I suppose so, yes.

Nico
--

1 2 >

1 - 100 of 120 matches

Mail list logo