date:20210527

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread houzj.f...@fujitsu.com

From: Bharath Rupireddy 
Sent: Thursday, May 27, 2021 2:59 PM
> On Thu, May 27, 2021 at 12:19 PM houzj.f...@fujitsu.com
>  wrote:
> > BTW, I checked my test results, I was testing INSERT INTO unlogged table.
> 
> What do you mean by "testing INSERT INTO"? Is it that you are testing the
> timings for parallel inserts in INSERT INTO ... SELECT command? If so, why
> should we test parallel inserts in the INSERT INTO ... SELECT command here?

Oops, sorry, it's a typo, I actually meant CREATE TABLE AS SELECT.

Best regards,
houzj

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Dilip Kumar  
Basically you are creating a new table and loading data to it and that means 
you will be less likely to access those data soon so for such thing spoiling 
buffer cache may not be a good idea.
--

Some people, including me, would say that the table will be accessed soon and 
that's why the data is loaded quickly during minimal maintenance hours.


--
I was just suggesting only for experiments for identifying the root cause.
--

I thought this is a good chance to possibly change things better (^^).
I guess the user would simply think like this: "I just want to finish CTAS as 
quickly as possible, so I configured to take advantage of parallelism.  I want 
CTAS to make most use of our resources.  Why doesn't Postgres try to limit 
resource usage (by using the ring buffer) against my will?"


Regards
Takayuki Tsunakawa

Re: Race condition in recovery?

2021-05-27 Thread Dilip Kumar

On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar  wrote 
> in
> > Maybe we can somehow achieve that without a broken archive command,
> > but I am not sure how it is enough to just delete WAL from pg_wal?  I
> > mean my original case was that
> > 1. Got the new history file from the archive but did not get the WAL
> > file yet which contains the checkpoint after TL switch
> > 2. So the standby2 try to stream using new primary using old TL and
> > set the wrong TL in expectedTLEs
> >
> > But if you are not doing anything to stop archiving WAL files or to
> > guarantee that WAL has come to archive and you deleted those then I am
> > not sure how we are reproducing the original problem.
>
> Thanks for the reply!
>
> We're writing at the very beginning of the switching segment at the
> promotion time. So it is guaranteed that the first segment of the
> newer timline won't be archived until the rest almost 16MB in the
> segment is consumed or someone explicitly causes a segment switch
> (including archive timeout).

I agree

> > BTW, I have also tested your script and I found below log, which shows
> > that standby2 is successfully able to select the timeline2 so it is
> > not reproducing the issue.  Am I missing something?
>
> standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
> and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
> 5 (or 3 and 4 in the later versions) are used in ths additional tests.
>
> So I think it shold be something different?

Yeah, I tested with your patch where you had a different test case,
with "026_timeline_issue_2.pl", I am able to reproduce the issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 12:46 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Dilip Kumar 
> Basically you are creating a new table and loading data to it and that means 
> you will be less likely to access those data soon so for such thing spoiling 
> buffer cache may not be a good idea.
> --
>
> Some people, including me, would say that the table will be accessed soon and 
> that's why the data is loaded quickly during minimal maintenance hours.
>
>
> --
> I was just suggesting only for experiments for identifying the root cause.
> --
>
> I thought this is a good chance to possibly change things better (^^).
> I guess the user would simply think like this: "I just want to finish CTAS as 
> quickly as possible, so I configured to take advantage of parallelism.  I 
> want CTAS to make most use of our resources.  Why doesn't Postgres try to 
> limit resource usage (by using the ring buffer) against my will?"

If the idea is to give the user control of whether or not to use the
separate RING BUFFER for bulk inserts/writes, then how about giving it
as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is
being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable)
and COPY. Furthermore, we could make the rel option an integer and
allow users to provide the size of the ring buffer they want to choose
for a particular bulk insert operation (of course with a max limit
which is not exceeding the shared buffers or some reasonable amount
not exceeding the RAM of the system).

I think we can discuss this in a separate thread and see what other
hackers think.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: houzj.f...@fujitsu.com 
> Although, the 4 workers case still has performance degradation compared to
> serial case.
> 
> SERIAL: 58759.213 ms
> PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms  [SKIP FSM]:
> 58633.924 ms
> PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms   [SKIP FSM]:
> 66,960.305 ms

Can you see any difference in table sizes?


Regards
Takayuki Tsunakawa

Re: Race condition in recovery?

2021-05-27 Thread Kyotaro Horiguchi

At Thu, 27 May 2021 12:47:30 +0530, Dilip Kumar  wrote 
in 
> On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar  
> > wrote in
> > We're writing at the very beginning of the switching segment at the
> > promotion time. So it is guaranteed that the first segment of the
> > newer timline won't be archived until the rest almost 16MB in the
> > segment is consumed or someone explicitly causes a segment switch
> > (including archive timeout).
> 
> I agree
>
> > > BTW, I have also tested your script and I found below log, which shows
> > > that standby2 is successfully able to select the timeline2 so it is
> > > not reproducing the issue.  Am I missing something?
> >
> > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
> > and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
> > 5 (or 3 and 4 in the later versions) are used in ths additional tests.
> >
> > So I think it shold be something different?
> 
> Yeah, I tested with your patch where you had a different test case,
> with "026_timeline_issue_2.pl", I am able to reproduce the issue.

That said, I don't object if we decide to choose the crafted archive
command as far as we consider the trade-offs between the two ways.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 1:03 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: houzj.f...@fujitsu.com 
> > Although, the 4 workers case still has performance degradation compared to
> > serial case.
> >
> > SERIAL: 58759.213 ms
> > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms  [SKIP FSM]:
> > 58633.924 ms
> > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms   [SKIP FSM]:
> > 66,960.305 ms
>
> Can you see any difference in table sizes?

Also, the number of pages the table occupies in each case along with
table size would give more insights.

I do as follows to get the number of pages a relation occupies:
CREATE EXTENSION pgstattuple;
SELECT pg_relpages('test');

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> I think we can discuss this in a separate thread and see what other
> hackers think.

OK, unless we won't get stuck in the current direction.  (Our goal is to not 
degrade in performance, but to outperform serial execution, isn't it?)


> If the idea is to give the user control of whether or not to use the
> separate RING BUFFER for bulk inserts/writes, then how about giving it
> as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is
> being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable)
> and COPY. Furthermore, we could make the rel option an integer and
> allow users to provide the size of the ring buffer they want to choose
> for a particular bulk insert operation (of course with a max limit
> which is not exceeding the shared buffers or some reasonable amount
> not exceeding the RAM of the system).

I think it's not a table property but an execution property.  So, it'd be 
appropriate to control it with the SET command, just like the DBA sets work_mem 
and maintenance_work_mem for specific maintenance operations.

I'll stop on this here...


Regards
Takayuki Tsunakawa

Re: Skipping logical replication transactions on subscriber side

2021-05-27 Thread Masahiko Sawada

On Thu, May 27, 2021 at 2:48 PM Amit Kapila  wrote:
>
> On Thu, May 27, 2021 at 9:56 AM Masahiko Sawada  wrote:
> >
> > On Wed, May 26, 2021 at 3:43 PM Amit Kapila  wrote:
> > >
> > > On Tue, May 25, 2021 at 12:26 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, May 24, 2021 at 7:51 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada 
> > > > >  wrote:
> > > > >
> > > > > I think you need to consider few more things here:
> > > > > (a) Say the error occurs after applying some part of changes, then
> > > > > just skipping the remaining part won't be sufficient, we probably need
> > > > > to someway rollback the applied changes (by rolling back the
> > > > > transaction or in some other way).
> > > >
> > > > After more thought, it might be better to that setting and resetting
> > > > the XID to skip requires disabling the subscription.
> > > >
> > >
> > > It might be better if it doesn't require disabling the subscription
> > > because it would be more steps for the user to disable/enable it. It
> > > is not clear to me what exactly you want to gain by disabling the
> > > subscription in this case.
> >
> > The situation I’m considered is where the user specifies the XID while
> > the worker is applying the changes of the transaction with that XID.
> > In this case, I think we need to somehow rollback the changes applied
> > so far. Perhaps we can either rollback the transaction and ignore the
> > remaining changes or restart and ignore the entire transaction from
> > the beginning.
> >
>
> If we follow your suggestion of only allowing XIDs that have been
> known to have conflicts then probably we don't need to worry about
> rollbacks.
>
> > > > >
> > > > > > For (2), what I'm thinking is to add a new action to ALTER
> > > > > > SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP
> > > > > > TRANSACTION 590. Also, we can have actions to reset it; ALTER
> > > > > > SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the
> > > > > > XID to a new column of pg_subscription or a new catalog, having the
> > > > > > worker reread its subscription information. Once the worker skipped
> > > > > > the specified transaction, it resets the transaction to skip on the
> > > > > > catalog.
> > > > > >
> > > > >
> > > > > What if we fail while updating the reset information in the catalog?
> > > > > Will it be the responsibility of the user to reset such a transaction
> > > > > or we will retry it after restart of worker?  Now, say, we give such a
> > > > > responsibility to the user and the user forgets to reset it then there
> > > > > is a possibility that after wraparound we will again skip the
> > > > > transaction which is not intended. And, if we want to retry it after
> > > > > restart of worker, how will the worker remember the previous failure?
> > > >
> > > > As described above, setting and resetting XID to skip is implemented
> > > > as a normal system catalog change, so it's crash-safe and persisted. I
> > > > think that the worker can either removes the XID or mark it as done
> > > > once it skipped the specified transaction so that it won't skip the
> > > > same XID again after wraparound.
> > > >
> > >
> > > It all depends on when exactly you want to update the catalog
> > > information. Say after skipping commit of the XID, we do update the
> > > corresponding LSN to be communicated as already processed to the
> > > subscriber and then get the error while updating the catalog
> > > information then next time we might not know whether to update the
> > > catalog for skipped XID.
> > >
> > > > Also, it might be better if we reset
> > > > the XID also when a subscription field such as subconninfo is changed
> > > > because it could imply the worker will connect to another publisher
> > > > having a different XID space.
> > > >
> > > > We also need to handle the cases where the user specifies an old XID
> > > > or XID whose transaction is already prepared on the subscriber. I
> > > > think the worker can reset the XID with a warning when it finds out
> > > > that the XID seems no longer valid or it cannot skip the specified
> > > > XID. For example in the former case, it can do that when the first
> > > > received transaction’s XID is newer than the specified XID.
> > > >
> > >
> > > But how can we guarantee that older XID can't be received later? Is
> > > there a guarantee that we receive the transactions on subscriber in
> > > XID order.
> >
> > Considering the above two comments, it might be better to provide a
> > way to skip the transaction that is already known to be conflicted
> > rather than allowing users to specify the arbitrary XID.
> >
>
> Okay, that makes sense but still not sure how will you identify if we
> need to reset XID in case of failure doing that in the previous
> attempt.

It's a just idea but we can record the failed transaction with XID as
well as its commit LSN passed? The sequence I'm thinking is,

1. the

Fix RADIUS error reporting in hba file parsing

2021-05-27 Thread Peter Eisentraut



The RADIUS-related checks in parse_hba_line() did not respect elevel
and did not fill in *err_msg.  Also, verify_option_list_length()
pasted together error messages in an untranslatable way.  To fix the
latter, remove the function and do the error checking inline.  It's a
bit more verbose but only minimally longer, and it makes fixing the
first two issues straightforward.
From 063a8ae81d5ac33e8700bb0ea076d1499d36b655 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 27 May 2021 10:26:21 +0200
Subject: [PATCH] Fix RADIUS error reporting in hba file parsing

The RADIUS-related checks in parse_hba_line() did not respect elevel
and did not fill in *err_msg.  Also, verify_option_list_length()
pasted together error messages in an untranslatable way.  To fix the
latter, remove the function and do the error checking inline.  It's a
bit more verbose but only minimally longer, and it makes fixing the
first two issues straightforward.
---
 src/backend/libpq/hba.c | 90 ++---
 1 file changed, 48 insertions(+), 42 deletions(-)

diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 60767f2957..3be8778d21 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -144,8 +144,6 @@ static List *tokenize_inc_file(List *tokens, const char 
*outer_filename,
   const char 
*inc_filename, int elevel, char **err_msg);
 static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
   int elevel, char 
**err_msg);
-static bool verify_option_list_length(List *options, const char *optionname,
- List 
*comparelist, const char *comparename, int line_num);
 static ArrayType *gethba_options(HbaLine *hba);
 static void fill_hba_line(Tuplestorestate *tuple_store, TupleDesc tupdesc,
  int lineno, HbaLine *hba, 
const char *err_msg);
@@ -1607,21 +1605,23 @@ parse_hba_line(TokenizedLine *tok_line, int elevel)
 
if (list_length(parsedline->radiusservers) < 1)
{
-   ereport(LOG,
+   ereport(elevel,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
 errmsg("list of RADIUS servers cannot 
be empty"),
 errcontext("line %d of configuration 
file \"%s\"",
line_num, 
HbaFileName)));
+   *err_msg = "list of RADIUS servers cannot be empty";
return NULL;
}
 
if (list_length(parsedline->radiussecrets) < 1)
{
-   ereport(LOG,
+   ereport(elevel,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
 errmsg("list of RADIUS secrets cannot 
be empty"),
 errcontext("line %d of configuration 
file \"%s\"",
line_num, 
HbaFileName)));
+   *err_msg = "list of RADIUS secrets cannot be empty";
return NULL;
}
 
@@ -1630,24 +1630,53 @@ parse_hba_line(TokenizedLine *tok_line, int elevel)
 * but that's already checked above), 1 (use the same value
 * everywhere) or the same as the number of servers.
 */
-   if (!verify_option_list_length(parsedline->radiussecrets,
-  
"RADIUS secrets",
-  
parsedline->radiusservers,
-  
"RADIUS servers",
-  
line_num))
+   if (!(list_length(parsedline->radiussecrets) == 1 ||
+ list_length(parsedline->radiussecrets) == 
list_length(parsedline->radiusservers)))
+   {
+   ereport(elevel,
+   (errcode(ERRCODE_CONFIG_FILE_ERROR),
+errmsg("the number of RADIUS secrets 
(%d) must be 1 or the same as the number of RADIUS servers (%d)",
+   
list_length(parsedline->radiussecrets),
+   
list_length(parsedline->radiusservers)),
+errcontext("line %d of configuration 
file \"%s\"",
+   line_num, 
HbaFileName)));
+   *err_msg = psprintf("the numbe

Re: Skip partition tuple routing with constant partition key

2021-05-27 Thread Amit Langote

Hou-san,

On Thu, May 27, 2021 at 3:56 PM houzj.f...@fujitsu.com
 wrote:
> From: Amit Langote 
> Sent: Thursday, May 27, 2021 1:54 PM
> > On Thu, May 27, 2021 at 11:47 AM houzj.f...@fujitsu.com
> >  wrote:
> > > About teaching relcache about caching the target partition.
> > >
> > > David-san suggested cache the partidx in PartitionDesc.
> > > And it will need looping and checking the cached value at each level.
> > > I was thinking can we cache a partidx list[1, 2 ,3], and then we can
> > > follow the list to get the last partition and do the partition CHECK
> > > only for the last partition. If any unexpected thing happen, we can
> > > return to the original table and redo the tuple routing without using the
> > cached index.
> > > What do you think ?
> >
> > Where are you thinking to cache the partidx list?  Inside PartitionDesc or 
> > some
> > executor struct?
>
> I was thinking cache the partidx list in PartitionDescData which is in 
> relcache, if possible, we can
> use the cached partition between statements.

Ah, okay.  I thought you were talking about a different idea.  How and
where would you determine that a cached partidx value is indeed the
correct one for a given row?

Anyway, do you want to try writing a patch to see how it might work?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread Amit Kapila

On Thu, May 27, 2021 at 10:27 AM Dilip Kumar  wrote:
>
> On Thu, May 27, 2021 at 10:16 AM Bharath Rupireddy
>  wrote:
> >
> > On Thu, May 27, 2021 at 7:12 AM houzj.f...@fujitsu.com
> >  wrote:
> > > I am afraid that the using the FSM seems not get a stable performance 
> > > gain(at least on my machine),
> > > I will take a deep look into this to figure out the difference. A naive 
> > > idea it that the benefit that bulk extension
> > > bring is not much greater than the cost in FSM.
> > > Do you have some ideas on it ?
> >
> > I think, if we try what Amit and I said in [1], we should get some
> > insights on whether the bulk relation extension is taking more time or
> > the FSM lookup. I plan to share the testing patch adding the timings
> > and the counters so that you can also test from your end. I hope
> > that's fine with you.
>
> I think some other cause of contention on relation extension locks are
> 1. CTAS is using a buffer strategy and due to that, it might need to
> evict out the buffer frequently for getting the new block in.  Maybe
> we can identify by turning off the buffer strategy for CTAS and
> increasing the shared buffer so that data fits in memory.
>

One more thing to ensure is whether all the workers are using the same
access strategy?

-- 
With Regards,
Amit Kapila.

Re: storing an explicit nonce

2021-05-27 Thread Neil Chen

Greetings,

On Thu, May 27, 2021 at 4:52 PM Bruce Momjian  wrote:

> >
> > I am confused why checksums, which are widely used, acceptably require
> > wal_log_hints, but there is concern that file encryption, which is
> > heavier, cannot acceptably require wal_log_hints.  I must be missing
> > something.
> >
> > Why can't checksums also throw away hint bit changes like you want to do
> > for file encryption and not require wal_log_hints?
>
>
I'm really confused about it, too. I read the above communication, not sure
if my understanding is correct... What we are facing is not only the change
of flag such as *pd_flags*, but also others like pointer array changes in
btree like Robert said. We don't need them to write a WAL record.

I have an immature idea, could we use LSN+blkno+checksum as the nonce when
the checksum enabled? And when the checksum disabled, we just use a
global counter to generate a number as the fake checksum value... Then we
also use LSN+blkno+fake_checksum as the nonce. Is there anything wrong with
that?

-- 
There is no royal road to learning.
HighGo Software Co.

Re: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 2:26 PM Amit Kapila  wrote:
> > I think some other cause of contention on relation extension locks are
> > 1. CTAS is using a buffer strategy and due to that, it might need to
> > evict out the buffer frequently for getting the new block in.  Maybe
> > we can identify by turning off the buffer strategy for CTAS and
> > increasing the shared buffer so that data fits in memory.
> >
>
> One more thing to ensure is whether all the workers are using the same
> access strategy?

In the Parallel Inserts in CTAS patches, the leader and each worker
uses its own ring buffer of 16MB i.e. does myState->bistate =
GetBulkInsertState(); separately.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Skipping logical replication transactions on subscriber side

2021-05-27 Thread Amit Kapila

On Thu, May 27, 2021 at 1:46 PM Masahiko Sawada  wrote:
>
> On Thu, May 27, 2021 at 2:48 PM Amit Kapila  wrote:
> >
> > Okay, that makes sense but still not sure how will you identify if we
> > need to reset XID in case of failure doing that in the previous
> > attempt.
>
> It's a just idea but we can record the failed transaction with XID as
> well as its commit LSN passed? The sequence I'm thinking is,
>
> 1. the worker records the XID and commit LSN of the failed transaction
> to a catalog.
>

When will you record this info? I am not sure if we can try to update
this when an error has occurred. We can think of using try..catch in
apply worker and then record it in catch on error but would that be
advisable? One random thought that occurred to me is to that apply
worker notifies such information to the launcher (or maybe another
process) which will log this information.

> 2. the user specifies how to resolve that conflict transaction
> (currently only 'skip' is supported) and writes to the catalog.
> 3. the worker does the resolution method according to the catalog. If
> the worker didn't start to apply those changes, it can skip the entire
> transaction. If did, it rollbacks the transaction and ignores the
> remaining.
>
> The worker needs neither to reset information of the last failed
> transaction nor to mark the conflicted transaction as resolved. The
> worker will ignore that information when checking the catalog if the
> commit LSN is passed.
>

So won't this require us to check the required info in the catalog
before applying each transaction? If so, that might be overhead, maybe
we can build some cache of the highest commitLSN that can be consulted
rather than the catalog table. I think we need to think about when to
remove rows for which conflict has been resolved as we can't let that
information grow infinitely.

> > Also, I am thinking that instead of a stat view, do we need
> > to consider having a system table (pg_replication_conflicts or
> > something like that) for this because what if stats information is
> > lost (say either due to crash or due to udp packet loss), can we rely
> > on stats view for this?
>
> Yeah, it seems better to use a catalog.
>

Okay.

-- 
With Regards,
Amit Kapila.

Re: Skipping logical replication transactions on subscriber side

2021-05-27 Thread Amit Kapila

On Thu, May 27, 2021 at 12:01 PM Masahiko Sawada  wrote:
>
> On Wed, May 26, 2021 at 6:11 PM Amit Kapila  wrote:
> >
> > I agree with you that specifying XID could be easier and
> > understandable for users. I was thinking and studying a bit about what
> > other systems do in this regard. Why don't we try to provide conflict
> > resolution methods for users? The idea could be that either the
> > conflicts can be resolved automatically or manually. In the case of
> > manual resolution, users can use the existing methods or the XID stuff
> > you are proposing here and in case of automatic resolution, the
> > in-built or corresponding user-defined functions will be invoked for
> > conflict resolution. There are more details to figure out in the
> > automatic resolution scheme but I see a lot of value in doing the
> > same.
>
> Yeah, I also see a lot of value in automatic conflict resolution. But
> maybe we can have both ways? For example, in case where the user wants
> to resolve conflicts in different ways or a conflict that cannot be
> resolved by automatic resolution (not sure there is in practice
> though), the manual resolution would also have value.
>

Right, that is exactly what I was saying. So, even if both can be done
as separate patches, we should try to design the manual resolution in
a way that can be extended for an automatic resolution system. I think
we can try to have some initial idea/design/POC for an automatic
resolution as well to ensure that the manual resolution scheme can be
further extended.

-- 
With Regards,
Amit Kapila.

Re: logical replication empty transactions

2021-05-27 Thread vignesh C

On Tue, May 25, 2021 at 6:36 PM Ajin Cherian  wrote:
>
> On Tue, Apr 27, 2021 at 1:49 PM Ajin Cherian  wrote:
>
> Rebased the patch as it was no longer applying.

Thanks for the updated patch, few comments:
1) I'm not sure if we could add some tests for skip empty
transactions, if possible add a few tests.

2) We could add some debug level log messages for the transaction that
will be skipped.

3) You could keep this variable below the other bool variables in the structure:
+   boolsent_begin_txn; /* flag indicating whether begin
+
  * has already been sent */
+

4) You can split the comments to multi-line as it exceeds 80 chars
+   /* output BEGIN if we haven't yet, avoid for streaming and
non-transactional messages */
+   if (!data->sent_begin_txn && !in_streaming && transactional)
+   pgoutput_begin(ctx, txn);

Regards,
Vignesh

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Julien Rouhaud

On Tue, May 25, 2021 at 4:39 PM Paul Guo  wrote:
>
> Hi hackers,
>
> I found this when reading the related code. Here is the scenario:
>
> bool
> RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
> bool retryOnError)
>
> For the case retryOnError is true, the function would in loop call
> ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
> we can see if we run into the below branch, RegisterSyncRequest() will
> need to loop until the checkpointer absorbs the existing requests so
> ForwardSyncRequest() might hang for some time until a checkpoint
> request is triggered. This scenario seems to be possible in theory
> though the chance is not high.

It seems like a really unlikely scenario, but maybe possible if you
use a lot of unlogged tables maybe (as you could eventually
dirty/evict more than NBuffers buffers without triggering enough WALs
activity to trigger a checkpoint with any sane checkpoint
configuration).

> ForwardSyncRequest():
>
> if (CheckpointerShmem->checkpointer_pid == 0 ||
> (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
>  !CompactCheckpointerRequestQueue()))
> {
> /*
>  * Count the subset of writes where backends have to do their own
>  * fsync
>  */
> if (!AmBackgroundWriterProcess())
> CheckpointerShmem->num_backend_fsync++;
> LWLockRelease(CheckpointerCommLock);
> return false;
> }
>
> One fix is to add below similar code in RegisterSyncRequest(), trigger
> a checkpoint for the scenario.
>
> // checkpointer_triggered: variable for one trigger only.
> if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
> !checkpointer_triggered)
> SetLatch(ProcGlobal->checkpointerLatch);
>
> Any comments?

It looks like you intended to set the checkpointer_triggered var but
didn't.  Also this will wake up the checkpointer but won't force a
checkpoint (unlike RequestCheckpoint()).  It may be a good thing
though as it would only absorb the requests and go back to sleep if no
other threshold is reachrf.  Apart from the implementation details it
seems like it could help in this unlikely event.

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread houzj.f...@fujitsu.com

From: Bharath Rupireddy 
Sent: Thursday, May 27, 2021 3:41 PM
> 
> On Thu, May 27, 2021 at 1:03 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: houzj.f...@fujitsu.com 
> > > Although, the 4 workers case still has performance degradation
> > > compared to serial case.
> > >
> > > SERIAL: 58759.213 ms
> > > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms  [SKIP FSM]:
> > > 58633.924 ms
> > > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms   [SKIP FSM]:
> > > 66,960.305 ms
> >
> > Can you see any difference in table sizes?
> 
> Also, the number of pages the table occupies in each case along with table 
> size
> would give more insights.
> 
> I do as follows to get the number of pages a relation occupies:
> CREATE EXTENSION pgstattuple;
> SELECT pg_relpages('test');

It seems the difference between SKIP FSM and NOT SKIP FSM is not big.
I tried serval times and the average result is almost the same.

pg_relpages
-
 1428575

pg_relation_size
-
11702976512(11G)

Best regards,
houzj

security_definer_search_path GUC

2021-05-27 Thread Marko Tiikkaja

Hi,

Since writing SECURITY DEFINER functions securely requires annoying
incantations[1], wouldn't it be nice if we provided a way for the superuser
to override the default search path via a GUC in postgresql.conf?  That way
you can set search_path if you want to override the default, but if you
leave it out you're not vulnerable, assuming security_definer_search_path
only contains secure schemas.


.m

Re: Add PortalDrop in exec_execute_message

2021-05-27 Thread Yura Sokolov

Alvaro Herrera писал 2021-05-26 23:59:

On 2021-May-25, Yura Sokolov wrote:

Tom Lane писал 2021-05-21 21:23:
> Yura Sokolov  writes:
> > I propose to add PortalDrop at the 'if (completed)' branch of
> > exec_execute_message.
>
> This violates our wire protocol specification, which
> specifically says
>
> If successfully created, a named portal object lasts till the end of
> the current transaction, unless explicitly destroyed. An unnamed
> portal is destroyed at the end of the transaction, or as soon as the
> next Bind statement specifying the unnamed portal as destination is
> issued. (Note that a simple Query message also destroys the unnamed
> portal.)
>
> I'm inclined to think that your complaint would be better handled
> by having the client send a portal-close command, if it's not
> going to do something else immediately.

I thought about it as well. Then, if I understand correctly,
PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send
"close portal" (CP) message after "execute" message, right?

I don't think they should do that.  The portal remains open, and the
libpq interface does that.  The portal gets closed at end of 
transaction

without the need for any client message.  I think if the client wanted
to close the portal ahead of time, it would need a new libpq entry 
point

to send the message to do that.

- PQsendQuery issues Query message, and exec_simple_query closes its
  portal.
- people doesn't expect PQsendQueryParams to be different from
  PQsendQuery aside of parameter sending. The fact that the portal
  remains open is a significant, unexpected and undesired difference.
- PQsendQueryGuts is used in PQsendQueryParams and PQsendQueryPrepared.
  It is always sends empty portal name and always "send me all rows"
  limit (zero). Both usages are certainly to just perform query and
  certainly no one expects portal remains open.

(I didn't add a Close Portal message to PQsendQueryInternal in pipeline
mode precisely because there is no such message in PQsendQueryGuts.

But PQsendQueryInternal should replicate behavior of PQsendQuery and
not PQsendQueryParams. Despite it has to use new protocol, it should
be indistinguishable to user, therefore portal should be closed.

I think it would be wrong to unconditionally add a Close Portal message
to any of those places.)

Why? If you foresee problems, please share your mind.

regards,
Sokolov Yura aka funny_falcon

Re: Add PortalDrop in exec_execute_message

2021-05-27 Thread Yura Sokolov


Tom Lane wrote 2021-05-27 00:19:

Alvaro Herrera  writes:
(I didn't add a Close Portal message to PQsendQueryInternal in 
pipeline

mode precisely because there is no such message in PQsendQueryGuts.
I think it would be wrong to unconditionally add a Close Portal 
message

to any of those places.)


Yeah, I'm not very comfortable with having libpq take it on itself
to do that, either.


But...

Tom Lane wrote 2021-05-21 21:23:

I'm inclined to think that your complaint would be better handled
by having the client send a portal-close command, if it's not
going to do something else immediately.


And given PQsendQueryParams should not be different from
PQsendQuery (aside of parameters sending) why shouldn't it close
its portal immediately, like it happens in exec_simple_query ?

I really doubt user of PQsendQueryPrepared is aware of portal as
well since it is also unnamed and also exhausted (because
PQsendQueryGuts always sends "send me all rows" limit).

And why PQsendQueryInternal should behave differently in pipelined
and not pipelined mode? It closes portal in not pipelined mode,
and will not close portal of last query in pipelined mode (inside
of transaction).


Looking back at the original complaint, it seems like it'd be fair to
wonder why we're still holding a page pin in a supposedly completed
executor run.  Maybe the right fix is somewhere in the executor
scan logic.


Perhaps because query is simple and portal is created as seek-able?



regards, tom lane


regards
Yura Sokolov

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 12:11 AM Tom Lane  wrote:
> AFAIR, there are zero promises about how effective, or when effective,
> changes in SET STORAGE will be.  And the number of complaints about
> that has also been zero.  So I'm not sure why we need to do more for
> SET COMPRESSION.  Especially since I'm unconvinced that recompressing
> everything just to recompress everything would *ever* be worthwhile.

I think it is good to have *some* way of ensuring that what you want
the system to do, it is actually doing. If we have not a single
operation in the system anywhere that can force recompression, someone
who actually cares will be left with no option but a dump and reload.
That is probably both a whole lot slower than something in the server
itself and also a pretty silly thing to have to tell people to do.

If it helps, I'd be perfectly fine with having this be an *optional*
behavior for CLUSTER or VACUUM FULL, depending on some switch. Or we
can devise another way for the user to make it happen. But we
shouldn't just be setting a policy that users are not allowed to care
whether their data is actually compressed using the compression method
they specified.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Bracket, brace, parenthesis

2021-05-27 Thread Michael Paquier

On Thu, May 27, 2021 at 03:20:10PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 14 May 2021 10:04:57 -0400, Tom Lane  wrote in 
>> +1.  I tend to write "square bracket" or "curly brace" when I want to
>> be extra clear, but I think the bare terms are widely understood to
>> have those meanings.
> 
> Thanks!  I think the message is new in 14 so we can fix it right
> away. The attached is the version with a commit message added.

No objections from me to fix that on HEAD now for clarity, let's wait
a bit and see if others have more comments.  You have missed an update
of multirangetypes.out, by the way.
--
Michael


signature.asc
Description: PGP signature

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Tom Lane

Robert Haas  writes:
> On Thu, May 27, 2021 at 12:11 AM Tom Lane  wrote:
>> AFAIR, there are zero promises about how effective, or when effective,
>> changes in SET STORAGE will be.  And the number of complaints about
>> that has also been zero.  So I'm not sure why we need to do more for
>> SET COMPRESSION.  Especially since I'm unconvinced that recompressing
>> everything just to recompress everything would *ever* be worthwhile.

> I think it is good to have *some* way of ensuring that what you want
> the system to do, it is actually doing. If we have not a single
> operation in the system anywhere that can force recompression, someone
> who actually cares will be left with no option but a dump and reload.
> That is probably both a whole lot slower than something in the server
> itself and also a pretty silly thing to have to tell people to do.

[ shrug... ]  I think the history of the SET STORAGE option teaches us
that there is no such requirement, and you're inventing a scenario that
doesn't exist in the real world.

regards, tom lane

Re: pg_rewind fails if there is a read only file.

2021-05-27 Thread Paul Guo

On Wed, May 26, 2021 at 10:32 PM Andrew Dunstan  wrote:
>
>
> On 5/26/21 2:43 AM, Laurenz Albe wrote:
> > On Wed, 2021-05-26 at 08:57 +0900, Michael Paquier wrote:
> >> On Tue, May 25, 2021 at 03:17:37PM -0400, Andrew Dunstan wrote:
> >>> If we do decide to do something the question arises what should it do?
> >>> If we're to allow for it I'm wondering if the best thing would be simply
> >>> to ignore such a file.
> >> Enforcing assumptions that any file could be ready-only is a very bad
> >> idea, as that could lead to weird behaviors if a FS is turned as
> >> becoming read-only suddenly while doing a rewind.  Another idea that
> >> has popped out across the years was to add an option to pg_rewind so
> >> as users could filter files manually.  That could be easily dangerous
> >> though in the wrong hands, as one could think that it is a good idea
> >> to skip a control file, for example.
> >>
> >> The thing is that here we actually know the set of files we'd like to
> >> ignore most of the time, and we still want to have some automated
> >> control what gets filtered.  So here is a new idea: we build a list of
> >> files based on a set of GUC parameters using postgres -C on the target
> >> data folder, and assume that these are safe enough to be skipped all
> >> the time, if these are in the data folder.
> > That sounds complicated, but should work.
> > There should be a code comment somewhere that warns people not to forget
> > to look at that when they add a new GUC.
> >
> > I can think of two alternatives to handle this:
> >
> > - Skip files that cannot be opened for writing and issue a warning.
> >   That is simple, but coarse.
> >   A slightly more sophisticated version would first check if files
> >   are the same on both machines and skip the warning for those.
> >
> > - Paul's idea to try and change the mode on the read-only file
> >   and reset it to the original state after pg_rewind is done.
> >
>
> How about we only skip (with a warning) read-only files if they are in
> the data root, not a subdirectory? That would save us from silently

The assumption seems to limit the user scenario.

> ignoring read-only filesystems and not involve using a GUC.

How about this:
By default, change and reset the file mode before and after open() for
read only files,
but we also allow to pass skip-file names to pg_rewind (e.g.
pg_rewind --skip filenameN or --skip-list-file listfile) in case users really
want to skip some files (probably not just read only files).
Presumably the people
who run pg_rewind should be DBA or admin that have basic knowledge of this
so we should not worry too much about that the user skips some important files
(we could even set a deny list for this). Also this solution works
fine for a read only
file system since pg_rewind soon quits when adding the write
permission for any read only file.

--
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud  wrote:
>
> On Tue, May 25, 2021 at 4:39 PM Paul Guo  wrote:
> >
> > Hi hackers,
> >
> > I found this when reading the related code. Here is the scenario:
> >
> > bool
> > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
> > bool retryOnError)
> >
> > For the case retryOnError is true, the function would in loop call
> > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
> > we can see if we run into the below branch, RegisterSyncRequest() will
> > need to loop until the checkpointer absorbs the existing requests so
> > ForwardSyncRequest() might hang for some time until a checkpoint
> > request is triggered. This scenario seems to be possible in theory
> > though the chance is not high.
>
> It seems like a really unlikely scenario, but maybe possible if you
> use a lot of unlogged tables maybe (as you could eventually
> dirty/evict more than NBuffers buffers without triggering enough WALs
> activity to trigger a checkpoint with any sane checkpoint
> configuration).

RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and
SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for
buffer sync.

> > ForwardSyncRequest():
> >
> > if (CheckpointerShmem->checkpointer_pid == 0 ||
> > (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests 
> > &&
> >  !CompactCheckpointerRequestQueue()))
> > {
> > /*
> >  * Count the subset of writes where backends have to do their own
> >  * fsync
> >  */
> > if (!AmBackgroundWriterProcess())
> > CheckpointerShmem->num_backend_fsync++;
> > LWLockRelease(CheckpointerCommLock);
> > return false;
> > }
> >
> > One fix is to add below similar code in RegisterSyncRequest(), trigger
> > a checkpoint for the scenario.
> >
> > // checkpointer_triggered: variable for one trigger only.
> > if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
> > !checkpointer_triggered)
> > SetLatch(ProcGlobal->checkpointerLatch);
> >
> > Any comments?
>
> It looks like you intended to set the checkpointer_triggered var but

Yes this is just pseduo code.

> didn't.  Also this will wake up the checkpointer but won't force a
> checkpoint (unlike RequestCheckpoint()).  It may be a good thing

I do not expect an immediate checkpoint. AbsorbSyncRequests()
is enough since after that RegisterSyncRequest() could finish.

> though as it would only absorb the requests and go back to sleep if no
> other threshold is reachrf.  Apart from the implementation details it
> seems like it could help in this unlikely event.



-- 
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

On Thu, May 27, 2021 at 9:59 PM Paul Guo  wrote:
>
> On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud  wrote:
> >
> > On Tue, May 25, 2021 at 4:39 PM Paul Guo  wrote:
> > >
> > > Hi hackers,
> > >
> > > I found this when reading the related code. Here is the scenario:
> > >
> > > bool
> > > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
> > > bool retryOnError)
> > >
> > > For the case retryOnError is true, the function would in loop call
> > > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
> > > we can see if we run into the below branch, RegisterSyncRequest() will
> > > need to loop until the checkpointer absorbs the existing requests so
> > > ForwardSyncRequest() might hang for some time until a checkpoint
> > > request is triggered. This scenario seems to be possible in theory
> > > though the chance is not high.
> >
> > It seems like a really unlikely scenario, but maybe possible if you
> > use a lot of unlogged tables maybe (as you could eventually
> > dirty/evict more than NBuffers buffers without triggering enough WALs
> > activity to trigger a checkpoint with any sane checkpoint
> > configuration).
>
> RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and
> SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for
> buffer sync.
>
> > > ForwardSyncRequest():
> > >
> > > if (CheckpointerShmem->checkpointer_pid == 0 ||
> > > (CheckpointerShmem->num_requests >= 
> > > CheckpointerShmem->max_requests &&
> > >  !CompactCheckpointerRequestQueue()))
> > > {
> > > /*
> > >  * Count the subset of writes where backends have to do their own
> > >  * fsync
> > >  */
> > > if (!AmBackgroundWriterProcess())
> > > CheckpointerShmem->num_backend_fsync++;
> > > LWLockRelease(CheckpointerCommLock);
> > > return false;
> > > }
> > >
> > > One fix is to add below similar code in RegisterSyncRequest(), trigger
> > > a checkpoint for the scenario.
> > >
> > > // checkpointer_triggered: variable for one trigger only.
> > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
> > > !checkpointer_triggered)
> > > SetLatch(ProcGlobal->checkpointerLatch);
> > >
> > > Any comments?
> >
> > It looks like you intended to set the checkpointer_triggered var but
>
> Yes this is just pseduo code.
>
> > didn't.  Also this will wake up the checkpointer but won't force a
> > checkpoint (unlike RequestCheckpoint()).  It may be a good thing
>
> I do not expect an immediate checkpoint. AbsorbSyncRequests()
> is enough since after that RegisterSyncRequest() could finish.
>
> > though as it would only absorb the requests and go back to sleep if no
> > other threshold is reachrf.  Apart from the implementation details it
> > seems like it could help in this unlikely event.
>

Also note that ForwardSyncRequest() does wake up the checkpointer if
it thinks the requests in shared memory are "too full", but does not
wake up when the request is actually full. This does not seem to be reasonable.
See below code in ForwardSyncRequest

/* If queue is more than half full, nudge the checkpointer to empty it */
too_full = (CheckpointerShmem->num_requests >=
CheckpointerShmem->max_requests / 2);

/* ... but not till after we release the lock */
if (too_full && ProcGlobal->checkpointerLatch)
SetLatch(ProcGlobal->checkpointerLatch);

-- 
Paul Guo (Vmware)

Re: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 9:53 AM Bharath Rupireddy
 wrote:
> > One idea to find this out could be that we have three counters for
> > each worker which counts the number of times each worker extended the
> > relation in bulk, the number of times each worker extended the
> > relation by one block, the number of times each worker gets the page
> > from FSM. It might be possible that with this we will be able to
> > figure out why there is a difference between your and Hou-San's
> > results.
>
> Yeah, that helps. And also, the time spent in
> LockRelationForExtension, ConditionalLockRelationForExtension,
> GetPageWithFreeSpace and RelationAddExtraBlocks too can give some
> insight.
>
> My plan is to have a patch with above info added in (which I will
> share it here so that others can test and see the results too) and run
> the "case 4" where there's a regression seen on my system.

I captured below information with the attached patch
0001-test-times-and-block-counts.patch applied on top of CTAS v23
patch set. Testing details are attached in the file named "test".
Total time spent in LockRelationForExtension
Total time spent in GetPageWithFreeSpace
Total time spent in RelationAddExtraBlocks
Total number of times extended the relation in bulk
Total number of times extended the relation by one block
Total number of blocks added in bulk extension
Total number of times getting the page from FSM

Here is a summary of what I observed:
1) The execution time with 2 workers, without TABLE_INSERT_SKIP_FSM
(140 sec) is more than with 0 workers (112 sec)
2) The execution time with 2 workers, with TABLE_INSERT_SKIP_FSM (225
sec) is more than with 2 workers, without TABLE_INSERT_SKIP_FSM (140
sec)
3) Majority of the time is going into waiting for relation extension
lock in LockRelationForExtension. With 2 workers, without
TABLE_INSERT_SKIP_FSM, out of total execution time 140 sec, the time
spent in LockRelationForExtension is ~40 sec and the time spent in
RelationAddExtraBlocks is ~20 sec. So, ~60 sec are being spent in
these two functions. With 2 workers, with TABLE_INSERT_SKIP_FSM, out
of total execution time 225 sec, the time spent in
LockRelationForExtension is ~135 sec and the time spent in
RelationAddExtraBlocks is 0 sec (because we skip FSM, no bulk extend
logic applies). So, most of the time is being spent in
LockRelationForExtension.

I'm still not sure why the execution time with 0 workers (or serial
execution or no parallelism involved) on my testing system is 112 sec
compared to 58 sec on Hou-San's system for the same use case. Maybe
the testing system I'm using is not of the latest configuration
compared to others.

Having said that, I request others to try and see if the same
observations (as above) are made on their testing systems for the same
use case. If others don't see regression (with just 2 workers) or they
observe not much difference with and without TABLE_INSERT_SKIP_FSM.
I'm open to changing the parallel inserts in CTAS code to use
TABLE_INSERT_SKIP_FSM. In any case, if the observation is that there's
a good amount of time being spent in LockRelationForExtension, I'm not
sure (at this point) whether we can do something here or just live
with it.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 105495d5aa6b25f61c812dd5cb9ff692abe38373 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 27 May 2021 13:37:32 +0530
Subject: [PATCH] test times and block counts

---
 src/backend/access/heap/hio.c | 21 
 src/backend/commands/createas.c   | 41 +++
 src/backend/storage/freespace/freespace.c | 22 +++-
 src/backend/storage/lmgr/lmgr.c   | 13 +++
 src/include/commands/createas.h   | 17 ++
 5 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index d34edb4190..d737c2d763 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -19,6 +19,8 @@
 #include "access/hio.h"
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
+#include "commands/createas.h"
+#include "portability/instr_time.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -198,12 +200,17 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 firstBlock = InvalidBlockNumber;
 	int			extraBlocks;
 	int			lockWaiters;
+	instr_time	start;
+	instr_time	end;

 	/* Use the length of the lock wait queue to judge how much to extend. */
 	lockWaiters = RelationExtensionLockWaiterCount(relation);
 	if (lockWaiters <= 0)
 		return;

+	if (is_ctas)
+		INSTR_TIME_SET_CURRENT(start);
+
 	/*
 	 * It might seem like multiplying the number of lock waiters by as much as
 	 * 20 is too aggressive, but benchmarking revealed that smaller numbers
@@ -212,6 +219,9 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	 */
 	extraBlocks = Min(512, lockWaite

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Dilip Kumar

On Thu, May 27, 2021 at 7:04 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Thu, May 27, 2021 at 12:11 AM Tom Lane  wrote:
> >> AFAIR, there are zero promises about how effective, or when effective,
> >> changes in SET STORAGE will be.  And the number of complaints about
> >> that has also been zero.  So I'm not sure why we need to do more for
> >> SET COMPRESSION.  Especially since I'm unconvinced that recompressing
> >> everything just to recompress everything would *ever* be worthwhile.
>
> > I think it is good to have *some* way of ensuring that what you want
> > the system to do, it is actually doing. If we have not a single
> > operation in the system anywhere that can force recompression, someone
> > who actually cares will be left with no option but a dump and reload.
> > That is probably both a whole lot slower than something in the server
> > itself and also a pretty silly thing to have to tell people to do.
>
> [ shrug... ]  I think the history of the SET STORAGE option teaches us
> that there is no such requirement, and you're inventing a scenario that
> doesn't exist in the real world.

But can we compare SET STORAGE with SET compression?  I mean storage
just controls how the data are stored internally and there is no
external dependency.  But if we see the compression it will have a
dependency on the external library.  So if the user wants to get rid
of the dependency on the external library then IMHO, there should be
some way to do it by recompressing all the data.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Julien Rouhaud

On Thu, May 27, 2021 at 9:59 PM Paul Guo  wrote:
>
> > It seems like a really unlikely scenario, but maybe possible if you
> > use a lot of unlogged tables maybe (as you could eventually
> > dirty/evict more than NBuffers buffers without triggering enough WALs
> > activity to trigger a checkpoint with any sane checkpoint
> > configuration).
>
> RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and
> SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for
> buffer sync.

I know, but the checkpointer can hold up to NBuffers requests, so I
highly doubt that you can end up filling the buffer with those.

> > > ForwardSyncRequest():
> > >
> > > if (CheckpointerShmem->checkpointer_pid == 0 ||
> > > (CheckpointerShmem->num_requests >= 
> > > CheckpointerShmem->max_requests &&
> > >  !CompactCheckpointerRequestQueue()))
> > > {
> > > /*
> > >  * Count the subset of writes where backends have to do their own
> > >  * fsync
> > >  */
> > > if (!AmBackgroundWriterProcess())
> > > CheckpointerShmem->num_backend_fsync++;
> > > LWLockRelease(CheckpointerCommLock);
> > > return false;
> > > }
> > >
> > > One fix is to add below similar code in RegisterSyncRequest(), trigger
> > > a checkpoint for the scenario.
> > >
> > > // checkpointer_triggered: variable for one trigger only.
> > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
> > > !checkpointer_triggered)
> > > SetLatch(ProcGlobal->checkpointerLatch);
> > >
> > > Any comments?
> >
> > It looks like you intended to set the checkpointer_triggered var but
>
> Yes this is just pseduo code.
>
> > didn't.  Also this will wake up the checkpointer but won't force a
> > checkpoint (unlike RequestCheckpoint()).  It may be a good thing
>
> I do not expect an immediate checkpoint. AbsorbSyncRequests()
> is enough since after that RegisterSyncRequest() could finish.

You said "trigger a checkpoint", which sounded more like forcing a
checkpointer rather than waking up the checkpointer so that it can
absorb the pending requests, so it seems worth to mention what it
would really do.

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

> You said "trigger a checkpoint", which sounded more like forcing a
> checkpointer rather than waking up the checkpointer so that it can
> absorb the pending requests, so it seems worth to mention what it
> would really do.

Yes it is not accurate. Thanks for the clarification.

-- 
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Julien Rouhaud

On Thu, May 27, 2021 at 10:05 PM Paul Guo  wrote:
>
> Also note that ForwardSyncRequest() does wake up the checkpointer if
> it thinks the requests in shared memory are "too full", but does not
> wake up when the request is actually full. This does not seem to be 
> reasonable.
> See below code in ForwardSyncRequest
>
> /* If queue is more than half full, nudge the checkpointer to empty it */
> too_full = (CheckpointerShmem->num_requests >=
> CheckpointerShmem->max_requests / 2);
>
> /* ... but not till after we release the lock */
> if (too_full && ProcGlobal->checkpointerLatch)
> SetLatch(ProcGlobal->checkpointerLatch);

Ah indeed.  Well it means that the checkpointer it woken up early
enough to avoid reaching that point.  I'm not sure that it's actually
possible to reach a point where the list if full and the checkpointer
is sitting idle.

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 10:18 AM Dilip Kumar  wrote:
> > [ shrug... ]  I think the history of the SET STORAGE option teaches us
> > that there is no such requirement, and you're inventing a scenario that
> > doesn't exist in the real world.
>
> But can we compare SET STORAGE with SET compression?  I mean storage
> just controls how the data are stored internally and there is no
> external dependency.  But if we see the compression it will have a
> dependency on the external library.  So if the user wants to get rid
> of the dependency on the external library then IMHO, there should be
> some way to do it by recompressing all the data.

TBH, I'm more concerned about the other direction. Surely someone who
upgrades from an existing release to v14 and sets their compression
method to lz4 is going to want a way of actually converting their data
to using lz4. To say that nobody cares about that is to deem the
feature useless. Maybe that's what Tom thinks, but it's not what I
think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Speed up pg_checksums in cases where checksum already set

2021-05-27 Thread Greg Sabino Mullane

Thanks for the quick replies, everyone.

On Wed, May 26, 2021 at 10:17 PM Michael Paquier 
wrote:

>
> -if (do_sync)
> +if (do_sync && total_files_modified)
> Here, I am on the edge.  It could be an advantage to force a flush of
> the data folder anyway, no?

I was originally on the fence about including this as well, but it seems
like since the database is shut down and already in a consistent state,
there seems no advantage to syncing if we have not made any changes. Things
are no better or worse than when we arrived. However, the real-world use
case of running pg_checksums --enable and getting no changed blocks is
probably fairly rare, so if there is a strong objection, I'm happy
reverting to just (do_sync). (I'm not sure how cheap a sync is, I assume
it's low impact as the database is shut down, I guess it becomes a "might
as well while we are here"?)

On Wed, May 26, 2021 at 10:29 PM Justin Pryzby  wrote:

> In one of the checksum patches, there was an understanding that the pages
> should be written even if the checksum is correct, to handle replicas.
> ...
> Does your patch complicate things for the "stop all the clusters before
> switching them all" case?
>

I cannot imagine how it would, but, like Michael, I'm not really
understanding the reasoning here. We only run when safely shutdown, so no
WAL or dirty buffers need concern us :). Of course, once the postmaster is
up and running, fiddling with checksums becomes vastly more complicated, as
evidenced by that thread. I'm happy sticking to and speeding up the offline
version for now.

Cheers,
Greg

Re: Reducing the range of OIDs consumed by genbki.pl

2021-05-27 Thread Robert Haas

On Wed, May 26, 2021 at 5:43 PM Tom Lane  wrote:
> So I propose shoehorning this into v14, to avoid shipping a
> release where FirstBootstrapObjectId has been bumped up.

Just to repeat on this thread what I said on the other one, I am +1 on
this as a concept. I have not reviewed the patch.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Tom Lane

Robert Haas  writes:
> On Thu, May 27, 2021 at 10:18 AM Dilip Kumar  wrote:
>>> [ shrug... ]  I think the history of the SET STORAGE option teaches us
>>> that there is no such requirement, and you're inventing a scenario that
>>> doesn't exist in the real world.

>> But can we compare SET STORAGE with SET compression?  I mean storage
>> just controls how the data are stored internally and there is no
>> external dependency.  But if we see the compression it will have a
>> dependency on the external library.  So if the user wants to get rid
>> of the dependency on the external library then IMHO, there should be
>> some way to do it by recompressing all the data.

> TBH, I'm more concerned about the other direction. Surely someone who
> upgrades from an existing release to v14 and sets their compression
> method to lz4 is going to want a way of actually converting their data
> to using lz4. To say that nobody cares about that is to deem the
> feature useless. Maybe that's what Tom thinks, but it's not what I
> think.

What I'm hearing is a whole lot of hypothesizing and zero evidence of
actual field requirements.  On the other side of the coin, we've already
wasted significant person-hours on fixing this feature's memory leakage,
and now people are proposing to expend more effort on solving^Wpapering
over its performance issues by adding yet more user-visible complication.
It's already adding too much user-visible complication IMO --- I know
because I was just copy-editing the documentation about that yesterday.

I say it's time to stop the bleeding and rip it out.  When and if
there are actual field requests to have a way to do this, we can
discuss what's the best way to respond to those requests.  Hacking
VACUUM probably isn't the best answer, anyway.  But right now,
we are past feature freeze, and I think we ought to jettison this
one rather than quickly kluge something.

regards, tom lane

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Wed, May 26, 2021 at 4:40 PM Bruce Momjian  wrote:
> You are saying that by using a non-LSN nonce, you can write out the page
> with a new nonce, but the same LSN, and also discard the page during
> crash recovery and use the WAL copy?

I don't know what "discard the page during crash recovery and use the
WAL copy" means.

> I am confused why checksums, which are widely used, acceptably require
> wal_log_hints, but there is concern that file encryption, which is
> heavier, cannot acceptably require wal_log_hints.  I must be missing
> something.

I explained this in the first complete paragraph of my first email
with this subject line: "For example, right now, we only need to WAL
log hints for the first write to each page after a checkpoint, but in
this approach, if the same page is written multiple times per
checkpoint cycle, we'd need to log hints every time." That's a huge
difference. Page eviction in some workloads can push the same pages
out of shared buffers every few seconds, whereas something that has to
be done once per checkpoint cycle cannot affect each page nearly so
often. A checkpoint is only going to occur every 5 minutes by default,
or more realistically every 10-15 minutes in a well-tuned production
system. In other words, we're not holding up some kind of double
standard, where the existing feature is allowed to depend on doing a
certain thing but your feature isn't allowed to depend on the same
thing. Your design depends on doing something which is potentially
100x+ more expensive than the existing thing. It's not always going to
be that expensive, but it can be.

> Why can't checksums also throw away hint bit changes like you want to do
> for file encryption and not require wal_log_hints?

Well, I don't want to throw away hint bit changes, just like we don't
throw them away right now. And I want to do that by making sure that
each time the page is written, we use a different nonce, but without
the expense of having to advance the LSN.

Now, another option is to do what you suggest here. We could say that
if a dirty page is evicted, but the page is only dirty because of
hint-type changes, we don't actually write it out. That does avoid
using the same nonce for multiple writes, because now there's only one
write. It also fixes the problem on standbys that Andres was
complaining about, because on a standby, the only way a page can
possibly be dirtied without an associated WAL record is through a
hint-type change. However, I think we'd find that this, too, is pretty
expensive in certain workloads. It's useful to write hint bits -
that's why we do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-26 07:14:47 +0200, Antonin Houska wrote:
> > Bruce Momjian  wrote:
> > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > > My point was about whether we need to change the nonce, and hence
> > > WAL-log full page images if we change hint bits.  If we don't and
> > > reencrypt the page with the same nonce, don't we only expose the hint
> > > bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> > > cases.
> > >
> > > I don't understand your computation above.  You decrypt the page into
> > > shared buffers, you change a hint bit, and rewrite the page.  You are
> > > re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> > > only change the hint bits in the new write?
> 
> Yea, I had a bit of a misfire there. Sorry.
> 
> I suspect that if we try to not disclose data if an attacker has write
> access, this still leaves us with issues around nonce reuse, unless we
> also employ integrity measures. Particularly due to CTR mode, which
> makes it easy to manipulate individual parts of the encrypted page
> without causing the decrypted page to be invalid. E.g. the attacker can
> just update pd_upper on the page by a small offset, and suddenly the
> replay will insert the tuple at a slightly shifted offset - which then
> seems to leak enough data to actually analyze things?

Yes, I don't think protecting from write access is a realistic goal at
this point, and frankly ever.  I think write access protection needs
all-cluster-file encryption.  This is documented:


https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch

Cluster file encryption does not protect against unauthorized
file system writes.  Such writes can allow data decryption if
used to weaken the system's security and the weakened system is
later supplied with the externally-stored cluster encryption key.
This also does not always detect if users with write access remove
or modify database files.

If this needs more text, let me know.

> As the patch stands that seems trivially doable, because as I read it,
> most of the page header is not encrypted, and checksums are done of the
> already encrypted data. But even if that weren't the case, brute forcing
> 16bit worth of checksum isn't too bad, even though it would obviously
> make an attack a lot more noisy.
> 
> https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92
>   /*
>* Check if the page has a special size == GISTPageOpaqueData, a valid
>* GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN.  
> This
>* is true for all GiST pages, and perhaps a few pages that are not.  
> The
>* only downside of guessing wrong is that we might not update the LSN 
> for
>* some non-permanent relation page changes, and therefore reuse the IV,
>* which seems acceptable.
>*/
> 
> Huh?

Are you asking about this C commention in relation to the discussion
above, or is it an independent question?  Are asking what it means?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: CREATE COLLATION - check for duplicate options and error out if found one

2021-05-27 Thread vignesh C

On Wed, May 26, 2021 at 7:44 PM Bharath Rupireddy
 wrote:
>
> On Wed, May 26, 2021 at 7:18 PM vignesh C  wrote:
> > +1 for fixing this issue, we have handled this error in other places.
> > The patch does not apply on head, could you rebase the patch on head
> > and post a new patch.
>
> Thanks. I thought of rebasing once the other patch (which reorganizes
> "...specified more than once" error) gets committed. Anyways, I've
> rebased for now on the latest master. Please review v2 patch.

The test changes look good to me, I liked the idea of rebasing the
source changes once "specified more than once" patch gets committed. I
will review the code changes after that.

Regards,
Vignesh

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 22:23:46 -0400, Stephen Frost wrote:
> > Andres mentioned other possible cases where the LSN doesn’t change even
> > though we change the page and, as he’s probably right, we would have to
> > figure out a solution in those cases too (potentially including cases like
> > crash recovery or replay on a replica where we can’t really just go around
> > creating dummy WAL records to get new LSNs..).
> 
> Yea, I think there's quite a few of those. For one, we don't guarantee
> that that the hole between pd_lower/upper is zeroes. It e.g. contains
> old tuple data after deleted tuples are pruned away. But when logging an
> FPI, we omit that range. Which means that after crash recovery the area
> is zeroed out. There's several cases where padding can result in the
> same.
>
> Just look at checkXLogConsistency(), heap_mask() et al for all the
> differences that can occur and that need to be ignored for the recovery
> consistency checking to work.
> 
> Particularly the hole issue seems trivial to exploit, because we know
> the plaintext of the hole after crash recovery (0s).
> 
> 
> I don't see how using the LSN alone is salvagable.

OK, so you are saying the replica would have all zeros because of crash
recovery, so XOR'ing that with the encryption steam makes the encryption
stream visible, and you could use that to decrypt the dead data on the
primary.  That is an interesting case that would need to fix.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > If we used a block cipher instead of a streaming one (CTR), this might
> > not work because the earlier blocks can be based in the output of
> > later blocks.
> 
> What made us choose CTR for WAL & data file encryption? I checked the
> README in the patchset and the wiki page, and neither seem to discuss
> that.
> 
> The dangers around nonce reuse, the space overhead of storing the nonce,
> the fact that single bit changes in the encrypted data don't propagate
> seem not great?  Why aren't we using something like XTS? It has obvious
> issues as wel, but CTR's weaknesses seem at least as great. And if we
> want a MAC, then we don't want CTR either.

We chose CTR because it was fast, and we could use the same method for
WAL, which needs a streaming, not block, cipher.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Thu, May 27, 2021 at 05:45:21PM +0800, Neil Chen wrote:
> Greetings,
> 
> On Thu, May 27, 2021 at 4:52 PM Bruce Momjian  wrote:
> 
> >
> > I am confused why checksums, which are widely used, acceptably require
> > wal_log_hints, but there is concern that file encryption, which is
> > heavier, cannot acceptably require wal_log_hints.  I must be missing
> > something.
> >
> > Why can't checksums also throw away hint bit changes like you want to do
> > for file encryption and not require wal_log_hints?
> 
> 
> 
> I'm really confused about it, too. I read the above communication, not sure if
> my understanding is correct... What we are facing is not only the change of
> flag such as *pd_flags*, but also others like pointer array changes in btree
> like Robert said. We don't need them to write a WAL record.

Well, the code now does write full page images for hint bit changes, so
it should work fine.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote:
> On Wed, May 26, 2021 at 4:40 PM Bruce Momjian  wrote:
> > You are saying that by using a non-LSN nonce, you can write out the page
> > with a new nonce, but the same LSN, and also discard the page during
> > crash recovery and use the WAL copy?
> 
> I don't know what "discard the page during crash recovery and use the
> WAL copy" means.

I was asking  how decoupling the nonce from the LSN allows for us to
avoid full page writes for hint bit changes.  I am guessing you are
saying that on recovery, if we see a hint-bit-only change in the WAL
(with a new nonce), we just throw away the page because it could be torn
and use the WAL full page write version.

> > I am confused why checksums, which are widely used, acceptably require
> > wal_log_hints, but there is concern that file encryption, which is
> > heavier, cannot acceptably require wal_log_hints.  I must be missing
> > something.
> 
> I explained this in the first complete paragraph of my first email
> with this subject line: "For example, right now, we only need to WAL
> log hints for the first write to each page after a checkpoint, but in
> this approach, if the same page is written multiple times per
> checkpoint cycle, we'd need to log hints every time." That's a huge
> difference. Page eviction in some workloads can push the same pages
> out of shared buffers every few seconds, whereas something that has to
> be done once per checkpoint cycle cannot affect each page nearly so
> often. A checkpoint is only going to occur every 5 minutes by default,
> or more realistically every 10-15 minutes in a well-tuned production
> system. In other words, we're not holding up some kind of double
> standard, where the existing feature is allowed to depend on doing a
> certain thing but your feature isn't allowed to depend on the same
> thing. Your design depends on doing something which is potentially
> 100x+ more expensive than the existing thing. It's not always going to
> be that expensive, but it can be.

Yes, it might be 1e100+++ more expensive too, but we don't know, and I
am not ready to add a lot of complexity for such an unknown.

> > Why can't checksums also throw away hint bit changes like you want to do
> > for file encryption and not require wal_log_hints?
> 
> Well, I don't want to throw away hint bit changes, just like we don't
> throw them away right now. And I want to do that by making sure that
> each time the page is written, we use a different nonce, but without
> the expense of having to advance the LSN.
> 
> Now, another option is to do what you suggest here. We could say that
> if a dirty page is evicted, but the page is only dirty because of
> hint-type changes, we don't actually write it out. That does avoid
> using the same nonce for multiple writes, because now there's only one
> write. It also fixes the problem on standbys that Andres was
> complaining about, because on a standby, the only way a page can
> possibly be dirtied without an associated WAL record is through a
> hint-type change. However, I think we'd find that this, too, is pretty
> expensive in certain workloads. It's useful to write hint bits -
> that's why we do it.

Oh, that does sound nice.  It is kind of an exit hatch if we are
evicting pages often for hint bit changes.  I like it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-05-27 Thread vignesh C

On Wed, May 26, 2021 at 7:55 PM Bharath Rupireddy
 wrote:
>
> On Wed, May 26, 2021 at 7:38 PM vignesh C  wrote:
> > > Attaching v5 patch, please have a look.
> >
> > We get the following error while adding an index:
> > create publication mypub for table idx_t1;
> > ERROR:  "idx_t1" is an index
> >
> > This error occurs internally from table_openrv function call, if we
> > could replace this with relation_openrv and then check the table kind,
> > we could throw a similar error message here too like the other changes
> > in the patch.
>
> Do you say that we replace table_open in publication_add_relation with
> relation_open and have the "\"%s\" is an index" or "\"%s\" is a
> composite type" checks in check_publication_add_relation? If that is
> so, I don't think it's a good idea to have the extra code in
> check_publication_add_relation and I would like it to be the way it is
> currently.

Before calling check_publication_add_relation, we will call
OpenTableList to get the list of relations. In openTableList we don't
include the errordetail for the failure like you have fixed it in
check_publication_add_relation. When a user tries to add index objects
or composite types, the error will be thrown earlier itself. I didn't
mean to change check_publication_add_relation, I meant to change
table_openrv to relation_openrv in OpenTableList and include error
details in case of failure like the change attached. If you are ok,
please include the change in your patch.

Regards,
Vignesh
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 95c253c8e0..b6380d7491 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/htup_details.h"
+#include "access/relation.h"
 #include "access/table.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
@@ -523,7 +524,21 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(rv, ShareUpdateExclusiveLock);
+		rel = relation_openrv(rv, ShareUpdateExclusiveLock);
+		if (rel->rd_rel->relkind == RELKIND_INDEX ||
+			rel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+			ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" is an index",
+		RelationGetRelationName(rel)),
+ errdetail("Index objects not supported by publications.")));
+		else if (rel->rd_rel->relkind == RELKIND_COMPOSITE_TYPE)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	errmsg("\"%s\" is a composite type",
+			RelationGetRelationName(rel)),
+	errdetail("Composite type not supported by publications.")));
+
 		myrelid = RelationGetRelid(rel);
 
 		/*

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > If we used a block cipher instead of a streaming one (CTR), this might
> > > not work because the earlier blocks can be based in the output of
> > > later blocks.
> > 
> > What made us choose CTR for WAL & data file encryption? I checked the
> > README in the patchset and the wiki page, and neither seem to discuss
> > that.
> > 
> > The dangers around nonce reuse, the space overhead of storing the nonce,
> > the fact that single bit changes in the encrypted data don't propagate
> > seem not great?  Why aren't we using something like XTS? It has obvious
> > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > want a MAC, then we don't want CTR either.
> 
> We chose CTR because it was fast, and we could use the same method for
> WAL, which needs a streaming, not block, cipher.

The WAL is block oriented too.

Andres

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > not work because the earlier blocks can be based in the output of
> > > > later blocks.
> > > 
> > > What made us choose CTR for WAL & data file encryption? I checked the
> > > README in the patchset and the wiki page, and neither seem to discuss
> > > that.
> > > 
> > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > the fact that single bit changes in the encrypted data don't propagate
> > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > want a MAC, then we don't want CTR either.
> > 
> > We chose CTR because it was fast, and we could use the same method for
> > WAL, which needs a streaming, not block, cipher.
> 
> The WAL is block oriented too.

I'm curious what you'd suggest for the heap where we wouldn't be able to
have block chaining (at least, I presume we aren't talking about
rewriting entire segments whenever we change something in a heap).

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 11:49:33 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > > not work because the earlier blocks can be based in the output of
> > > > > later blocks.
> > > > 
> > > > What made us choose CTR for WAL & data file encryption? I checked the
> > > > README in the patchset and the wiki page, and neither seem to discuss
> > > > that.
> > > > 
> > > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > > the fact that single bit changes in the encrypted data don't propagate
> > > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > > want a MAC, then we don't want CTR either.
> > > 
> > > We chose CTR because it was fast, and we could use the same method for
> > > WAL, which needs a streaming, not block, cipher.
> > 
> > The WAL is block oriented too.
> 
> I'm curious what you'd suggest for the heap where we wouldn't be able to
> have block chaining (at least, I presume we aren't talking about
> rewriting entire segments whenever we change something in a heap).

What prevents us from using something like XTS? I'm not saying that that
is the right approach, due to the fact that it leaks information about a
block being the same as an earlier version of the same block. But right
now we are talking about using CTR without addressing the weaknesses CTR
has, where a failure to increase the nonce is fatal (the code even
documents known cases where that could happen!), and where there's no
error propagation within a block.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 11:19 AM Bruce Momjian  wrote:
> On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote:
> > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian  wrote:
> > > You are saying that by using a non-LSN nonce, you can write out the page
> > > with a new nonce, but the same LSN, and also discard the page during
> > > crash recovery and use the WAL copy?
> >
> > I don't know what "discard the page during crash recovery and use the
> > WAL copy" means.
>
> I was asking  how decoupling the nonce from the LSN allows for us to
> avoid full page writes for hint bit changes.  I am guessing you are
> saying that on recovery, if we see a hint-bit-only change in the WAL
> (with a new nonce), we just throw away the page because it could be torn
> and use the WAL full page write version.

Well, in the design where the nonce is stored in the page, there is no
need for every hint-type change to appear in the WAL at all. Once per
checkpoint cycle, you need to write a full page image, as we do for
checksums or wal_log_hints. The rest of the time, you can just bump
the nonce and rewrite the page, same as we do today.

> Yes, it might be 1e100+++ more expensive too, but we don't know, and I
> am not ready to add a lot of complexity for such an unknown.

No, it can't be 1e100+++ more expensive, because it's not
realistically possible for a page to be written to disk 1e100+++ times
per checkpoint cycle. It is however entirely possible for it to be
written 100 times per checkpoint cycle. That is not something unknown
about which we need to speculate; it is easy to see that this can
happen, even on a simple test like pgbench with a data set larger than
shared buffers.

It is not right to confuse "we have no idea whether this will be
expensive" with "how expensive this will be is workload-dependent,"
which is what you seem to be doing here. If we had no idea whether
something would be expensive, then I agree that it might not be worth
adding complexity for it, or maybe some testing should be done first
to find out. But if we know for certain that in some workloads
something can be very expensive, then we had better at least talk
about whether it is worth adding complexity in order to resolve the
problem. And that is the situation here.

I am not even convinced that storing the nonce in the block is going
to be more complex, because it seems to me that the patches I posted
upthread worked out pretty cleanly. There are some things to discuss
and think about there, for sure, but it is not like we are talking
about inventing warp drive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 12:01:16 -0400, Bruce Momjian wrote:
> On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote:
> > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > > not work because the earlier blocks can be based in the output of
> > > > > later blocks.
> > > > 
> > > > What made us choose CTR for WAL & data file encryption? I checked the
> > > > README in the patchset and the wiki page, and neither seem to discuss
> > > > that.
> > > > 
> > > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > > the fact that single bit changes in the encrypted data don't propagate
> > > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > > want a MAC, then we don't want CTR either.
> > > 
> > > We chose CTR because it was fast, and we could use the same method for
> > > WAL, which needs a streaming, not block, cipher.
> > 
> > The WAL is block oriented too.
> 
> Well, AES block mode only does 16-byte blocks, as far as I know, and I
> assume WAL is more granular than that.

WAL is 8kB blocks by default. We only ever write it out with at least
that granularity.


> Also, you need to know the bytes _before_ the WAL do write a new
> 16-byte block, so it seems overly complex for our usage too.

See the XTS reference. Yes, it needs the previous 16bytes, but only
within the 8kB page.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Thu, May 27, 2021 at 12:03:00PM -0400, Robert Haas wrote:
> On Thu, May 27, 2021 at 11:19 AM Bruce Momjian  wrote:
> > I was asking  how decoupling the nonce from the LSN allows for us to
> > avoid full page writes for hint bit changes.  I am guessing you are
> > saying that on recovery, if we see a hint-bit-only change in the WAL
> > (with a new nonce), we just throw away the page because it could be torn
> > and use the WAL full page write version.
> 
> Well, in the design where the nonce is stored in the page, there is no
> need for every hint-type change to appear in the WAL at all. Once per
> checkpoint cycle, you need to write a full page image, as we do for
> checksums or wal_log_hints. The rest of the time, you can just bump
> the nonce and rewrite the page, same as we do today.

What is it about having the nonce be the LSN that doesn't allow that to
happen?  Could we just create a dummy LSN record and assign that to the
page and use that as a nonce.

> > Yes, it might be 1e100+++ more expensive too, but we don't know, and I
> > am not ready to add a lot of complexity for such an unknown.
> 
> No, it can't be 1e100+++ more expensive, because it's not
> realistically possible for a page to be written to disk 1e100+++ times
> per checkpoint cycle. It is however entirely possible for it to be
> written 100 times per checkpoint cycle. That is not something unknown
> about which we need to speculate; it is easy to see that this can
> happen, even on a simple test like pgbench with a data set larger than
> shared buffers.

I guess you didn't get my joke on that one.  ;-)

> It is not right to confuse "we have no idea whether this will be
> expensive" with "how expensive this will be is workload-dependent,"
> which is what you seem to be doing here. If we had no idea whether
> something would be expensive, then I agree that it might not be worth
> adding complexity for it, or maybe some testing should be done first
> to find out. But if we know for certain that in some workloads
> something can be very expensive, then we had better at least talk
> about whether it is worth adding complexity in order to resolve the
> problem. And that is the situation here.

Sure, but the downsides of avoiding it seem very high to me, not only in
code complexity but in requiring dump/reload or logical replication to
deploy.

> I am not even convinced that storing the nonce in the block is going
> to be more complex, because it seems to me that the patches I posted
> upthread worked out pretty cleanly. There are some things to discuss
> and think about there, for sure, but it is not like we are talking
> about inventing warp drive.

See above.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 11:10:00 -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote:
> > On 2021-05-25 22:23:46 -0400, Stephen Frost wrote:
> > > Andres mentioned other possible cases where the LSN doesn’t change even
> > > though we change the page and, as he’s probably right, we would have to
> > > figure out a solution in those cases too (potentially including cases like
> > > crash recovery or replay on a replica where we can’t really just go around
> > > creating dummy WAL records to get new LSNs..).
> > 
> > Yea, I think there's quite a few of those. For one, we don't guarantee
> > that that the hole between pd_lower/upper is zeroes. It e.g. contains
> > old tuple data after deleted tuples are pruned away. But when logging an
> > FPI, we omit that range. Which means that after crash recovery the area
> > is zeroed out. There's several cases where padding can result in the
> > same.
> >
> > Just look at checkXLogConsistency(), heap_mask() et al for all the
> > differences that can occur and that need to be ignored for the recovery
> > consistency checking to work.
> > 
> > Particularly the hole issue seems trivial to exploit, because we know
> > the plaintext of the hole after crash recovery (0s).
> > 
> > 
> > I don't see how using the LSN alone is salvagable.
> 
> OK, so you are saying the replica would have all zeros because of crash
> recovery, so XOR'ing that with the encryption steam makes the encryption
> stream visible, and you could use that to decrypt the dead data on the
> primary.  That is an interesting case that would need to fix.

I don't see how it's a viable security model to assume that you can
ensure that we never write different data with the same LSN. Yes, you
can fix a few cases, but how can we be confident that we're actually
doing a good job, when the consequences are pretty dramatic.

Nor do I think it's architecturally OK to impose a significant new
hurdle against doing any sort of "changing" writes on standbys.

It's time to move on from the idea of using the LSN as the nonce.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> Rather than surprise anyone, I might as well just come out and say some
> things.  First, I have always admitted this feature has limited
> usefulness.  
> 
> I think a non-LSN nonce adds a lot of code complexity, which adds a code
> and maintenance burden.  It also prevents the creation of an encrypted
> replica from a non-encrypted primary using binary replication, which
> makes deployment harder.
> 
> Take a feature of limited usefulness, add code complexity and deployment
> difficulty, and the feature becomes even less useful.
> 
> For these reasons, if we decide to go in the direction of using a
> non-LSN nonce, I no longer plan to continue working on this feature. I
> would rather work on things that have a more positive impact.  Maybe a
> non-LSN nonce is a better long-term plan, but there are too many
> unknowns and complexity for me to feel comfortable with it.

I had some more time to think about this.  The big struggle for this
feature has not been writing it, but rather keeping it lean enough that
its code complexity will be acceptable for a feature of limited
usefulness.  (The Windows port and pg_upgrade took similar approaches.)

Thinking about the feature to add checksums online, it seems to have
failed due to us over-complexifying the feature.  If we had avoided
allowing the checksum restart requirement, the patch would probably be
part of Postgres today.  However, a few people asked for
restart-ability, and since we don't really have much infrastructure to
do online whole-cluster changes, it added a lot of code.  Once the patch
was done, we looked at the code size and the benefits of the feature,
and decided it wasn't worth it.

I suspect that if we start adding a non-LSN nonce and malicious write
detection, we will end up with the same problem --- a complex patch for
a feature that has limited usefulness, and requires dump/restore or
logical replication to add it to a cluster.  I think such a patch would
be rejected, and I would probably even vote against it myself.

I don't want this to sound like I only want to do this my way, but I
also don't want to be silent when I smell failure, and if the
probability of failure gets too high, I am willing to abandon a feature
rather than continue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote:
> Hi,
> 
> On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > not work because the earlier blocks can be based in the output of
> > > > later blocks.
> > > 
> > > What made us choose CTR for WAL & data file encryption? I checked the
> > > README in the patchset and the wiki page, and neither seem to discuss
> > > that.
> > > 
> > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > the fact that single bit changes in the encrypted data don't propagate
> > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > want a MAC, then we don't want CTR either.
> > 
> > We chose CTR because it was fast, and we could use the same method for
> > WAL, which needs a streaming, not block, cipher.
> 
> The WAL is block oriented too.

Well, AES block mode only does 16-byte blocks, as far as I know, and I
assume WAL is more granular than that.  Also, you need to know the bytes
_before_ the WAL do write a new 16-byte block, so it seems overly
complex for our usage too.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 10:57:24 -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote:
> > I suspect that if we try to not disclose data if an attacker has write
> > access, this still leaves us with issues around nonce reuse, unless we
> > also employ integrity measures. Particularly due to CTR mode, which
> > makes it easy to manipulate individual parts of the encrypted page
> > without causing the decrypted page to be invalid. E.g. the attacker can
> > just update pd_upper on the page by a small offset, and suddenly the
> > replay will insert the tuple at a slightly shifted offset - which then
> > seems to leak enough data to actually analyze things?
> 
> Yes, I don't think protecting from write access is a realistic goal at
> this point, and frankly ever.  I think write access protection needs
> all-cluster-file encryption.  This is documented:
> 
>   
> https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch
> 
>   Cluster file encryption does not protect against unauthorized
>   file system writes.  Such writes can allow data decryption if
>   used to weaken the system's security and the weakened system is
>   later supplied with the externally-stored cluster encryption key.
>   This also does not always detect if users with write access remove
>   or modify database files.
> 
> If this needs more text, let me know.

Well, it's one thing to say that it's not a complete protection, and
another that a few byte sized writes to a single page are sufficient to
get access to encrypted data. And "all-cluster-file" encryption won't
help against the type of scenario I outlined.


> > https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92
> > /*
> >  * Check if the page has a special size == GISTPageOpaqueData, a valid
> >  * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN.  
> > This
> >  * is true for all GiST pages, and perhaps a few pages that are not.  
> > The
> >  * only downside of guessing wrong is that we might not update the LSN 
> > for
> >  * some non-permanent relation page changes, and therefore reuse the IV,
> >  * which seems acceptable.
> >  */
> > 
> > Huh?
> 
> Are you asking about this C commention in relation to the discussion
> above, or is it an independent question?  Are asking what it means?

The comment is blithely waving away a fundamental no-no (reusing nonces)
when using CTR mode as "acceptable".

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 12:01 PM Andres Freund  wrote:
> What prevents us from using something like XTS? I'm not saying that that
> is the right approach, due to the fact that it leaks information about a
> block being the same as an earlier version of the same block. But right
> now we are talking about using CTR without addressing the weaknesses CTR
> has, where a failure to increase the nonce is fatal (the code even
> documents known cases where that could happen!), and where there's no
> error propagation within a block.

I spent some time this morning reading up on XTS in general and also
on previous discussions on this list on the list. It seems like XTS is
considered state-of-the-art for full disk encryption, and what we're
doing seems to me to be similar in concept. The most useful on-list
discussion that I found was on this thread:

https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df

There are a lot of things that people said on that thread, but then
Bruce basically proposes CBC and/or CTR and I couldn't clearly
understand the reasons for that choice. Maybe there was some off-list
discussion of this that wasn't captured in the email traffic?

All that having been said, I am pretty sure I don't fully understand
what any of these modes involve. I gather that XTS requires two keys,
but it seems like it doesn't require a nonce. It seems to use a
"tweak" that is generated from the block number and the position
within the block (since an e.g. 8kB database block is being encrypted
as a bunch of 16-byte AES blocks) but apparently there's no problem
with the tweak being the same every time the block is encrypted? If no
nonce is required, that seems like a massive advantage, since then we
don't need to worry about how to get one or about how to make sure
it's never reused.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 10:47:13 -0400, Robert Haas wrote:
> Now, another option is to do what you suggest here. We could say that
> if a dirty page is evicted, but the page is only dirty because of
> hint-type changes, we don't actually write it out. That does avoid
> using the same nonce for multiple writes, because now there's only one
> write. It also fixes the problem on standbys that Andres was
> complaining about, because on a standby, the only way a page can
> possibly be dirtied without an associated WAL record is through a
> hint-type change.

What does that protect against that I was concerned about? That still
allows hint bits to be leaked, via

1) replay WAL record with FPI
2) hint bit change during read
3) incremental page change

vs 1) 3). Even if we declare that OK, it doesn't actually address the
whole issue of WAL replay not necessarily re-creating bit identical page
contents.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 12:28:39 -0400, Robert Haas wrote:
> All that having been said, I am pretty sure I don't fully understand
> what any of these modes involve. I gather that XTS requires two keys,
> but it seems like it doesn't require a nonce.

It needs a second secret, but that second secret can - as far as I
understand it - be generated using a strong prng and encrypted with the
"main" key, and stored in a central location.

> It seems to use a "tweak" that is generated from the block number and
> the position within the block (since an e.g. 8kB database block is
> being encrypted as a bunch of 16-byte AES blocks) but apparently
> there's no problem with the tweak being the same every time the block
> is encrypted?

Right. That comes with a price however: It leaks the information that a
block "version" is identical to an earlier version of the block. That's
obviously better than leaking information that allows decryption like
with the nonce reuse issue.

Nor does it provide integrity - which does seem like a significant issue
going forward. Which does require storing additional per-page data...

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Thu, May 27, 2021 at 12:28:39PM -0400, Robert Haas wrote:
> On Thu, May 27, 2021 at 12:01 PM Andres Freund  wrote:
> > What prevents us from using something like XTS? I'm not saying that that
> > is the right approach, due to the fact that it leaks information about a
> > block being the same as an earlier version of the same block. But right
> > now we are talking about using CTR without addressing the weaknesses CTR
> > has, where a failure to increase the nonce is fatal (the code even
> > documents known cases where that could happen!), and where there's no
> > error propagation within a block.
> 
> I spent some time this morning reading up on XTS in general and also
> on previous discussions on this list on the list. It seems like XTS is
> considered state-of-the-art for full disk encryption, and what we're
> doing seems to me to be similar in concept. The most useful on-list
> discussion that I found was on this thread:
> 
> https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df
> 
> There are a lot of things that people said on that thread, but then
> Bruce basically proposes CBC and/or CTR and I couldn't clearly
> understand the reasons for that choice. Maybe there was some off-list
> discussion of this that wasn't captured in the email traffic?

There was no other discussion about XTS that I know of.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 12:31 PM Andres Freund  wrote:
> What does that protect against that I was concerned about? That still
> allows hint bits to be leaked, via
>
> 1) replay WAL record with FPI
> 2) hint bit change during read
> 3) incremental page change
>
> vs 1) 3). Even if we declare that OK, it doesn't actually address the
> whole issue of WAL replay not necessarily re-creating bit identical page
> contents.

You're right. That seems fatal, as it would lead to encrypting the
different versions of the page with the IV on the master and the
standby, and the differences would consist of old data that could be
recovered by XORing the two encrypted page versions. To be clear, it
is tuple data that would be recovered, not just hint bits.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2021-05-27 12:28:39 -0400, Robert Haas wrote:
> > All that having been said, I am pretty sure I don't fully understand
> > what any of these modes involve. I gather that XTS requires two keys,
> > but it seems like it doesn't require a nonce.
> 
> It needs a second secret, but that second secret can - as far as I
> understand it - be generated using a strong prng and encrypted with the
> "main" key, and stored in a central location.

Yes, I'm fairly confident this is the case.

> > It seems to use a "tweak" that is generated from the block number and
> > the position within the block (since an e.g. 8kB database block is
> > being encrypted as a bunch of 16-byte AES blocks) but apparently
> > there's no problem with the tweak being the same every time the block
> > is encrypted?
> 
> Right. That comes with a price however: It leaks the information that a
> block "version" is identical to an earlier version of the block. That's
> obviously better than leaking information that allows decryption like
> with the nonce reuse issue.

Right, if we simply can't solve the nonce-reuse concern then that would
be better.

> Nor does it provide integrity - which does seem like a significant issue
> going forward. Which does require storing additional per-page data...

Yeah, this is one of the reasons that I hadn't really been thrilled with
XTS- I've really been looking down the road at eventually having GCM and
having actual integrity validation included.

That's not really a reason to rule it out though and Bruce's point about
having a way to get to an encrypted cluster from an unencrypted one is
certainly worth consideration.  Naturally, we'd need to document
everything appropriately but there isn't anything saying that we
couldn't, say, have XTS in v15 without any adjustments to the page
layout, accepting that there's no data integrity validation and focusing
just on encryption, and then returning to the question about adding in
data integrity validation for a future version, perhaps using the
special area for a nonce+tag with GCM or maybe something else.  Users
who wish to move to a cluster with encryption and data integrity
validation would have to get there through some other means than
replication, but that's going to always be the case because we have to
have space to store the tag, even if we can figure out some other
solution for the nonce.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-05-27 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 9:02 PM vignesh C  wrote:
> > Do you say that we replace table_open in publication_add_relation with
> > relation_open and have the "\"%s\" is an index" or "\"%s\" is a
> > composite type" checks in check_publication_add_relation? If that is
> > so, I don't think it's a good idea to have the extra code in
> > check_publication_add_relation and I would like it to be the way it is
> > currently.
>
> Before calling check_publication_add_relation, we will call
> OpenTableList to get the list of relations. In openTableList we don't
> include the errordetail for the failure like you have fixed it in
> check_publication_add_relation. When a user tries to add index objects
> or composite types, the error will be thrown earlier itself. I didn't
> mean to change check_publication_add_relation, I meant to change
> table_openrv to relation_openrv in OpenTableList and include error
> details in case of failure like the change attached. If you are ok,
> please include the change in your patch.

I don't think we need to change that. General intuition is that with
CREATE PUBLICATION ... FOR TABLE/FOR ALL TABLES  one can specify only
tables and if at all an index/composite type is specified, the error
messages """ is an index"/""" is a composite type" can imply
that they are not supported with CREATE PUBLICATION. There's no need
for a detailed error message saying "Index/Composite Type cannot be
added to publications.". Whereas foreign/unlogged/temporary/system
tables are actually tables, and we don't support them. So a detailed
error message here can state that explicitly.

I'm not taking the patch, attaching v5 again here to make cfbot happy
and for further review.

BTW, when we use relation_openrv, we have to use relation_close.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From c4c4efa96028fb9fe0b88e91c064a35484c6d8f0 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 5 Apr 2021 19:00:24 +0530
Subject: [PATCH v5] Improve publication error messages

Adding a foreign table into a publication prints an error saying "foo is not a
table". Although, a foreign table is not a regular table, this message could
possibly confuse users. Provide a suitable error message according to the
object class (table vs foreign table). While at it, separate unlogged/temp
table error message into 2 messages.
---
 .../postgres_fdw/expected/postgres_fdw.out|  6 ++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  5 +
 src/backend/catalog/pg_publication.c  | 20 ---
 src/test/regress/expected/publication.out | 16 +++
 src/test/regress/sql/publication.sql  | 14 +
 5 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 59e4e27ffb..b14fd8618c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9933,3 +9933,9 @@ DROP TABLE base_tbl4;
 DROP TABLE join_tbl;
 ALTER SERVER loopback OPTIONS (DROP async_capable);
 ALTER SERVER loopback2 OPTIONS (DROP async_capable);
+-- ===
+-- test error case for create publication on foreign table
+-- ===
+CREATE PUBLICATION testpub_foreign_tbl FOR TABLE ft1;  --ERROR
+ERROR:  "ft1" is a foreign table
+DETAIL:  Foreign tables cannot be added to publications.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 107d1c0e03..561c9af23a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3138,3 +3138,8 @@ DROP TABLE join_tbl;
 
 ALTER SERVER loopback OPTIONS (DROP async_capable);
 ALTER SERVER loopback2 OPTIONS (DROP async_capable);
+
+-- ===
+-- test error case for create publication on foreign table
+-- ===
+CREATE PUBLICATION testpub_foreign_tbl FOR TABLE ft1;  --ERROR
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 86e415af89..592d2d79e1 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -49,6 +49,14 @@
 static void
 check_publication_add_relation(Relation targetrel)
 {
+	/* FOREIGN table cannot be part of publication. */
+	if (RelationGetForm(targetrel)->relkind == RELKIND_FOREIGN_TABLE)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" is a foreign table",
+		RelationGetRelationName(targetrel)),
+ errdetail("Foreign tables cannot be added to publications.")));
+
 	/* Must be a regular or partitioned table */
 	if (RelationGetForm(targetrel)->relkind != RELKIND_RELATION &&
 		RelationGetForm(targetrel)->relkind != RE

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-26 16:57:31 +0900, Michael Paquier wrote:
> Yes, there should not be any as far as I recall.  2PC is kind of
> special with its fake ProcArray entries.

It's really quite an awful design :(


> > I think to fix the issue we'd have to move
> > ShutdownRecoveryTransactionEnvironment() to after 
> > XLogCtl->SharedRecoveryState
> > = RECOVERY_STATE_DONE.
> >
> > The acquisition of ProcArrayLock() in
> > ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
> > should prevent the data from being removed between the RecoveryInProgress()
> > and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().
> >
> > I haven't yet figured out whether there would be a problem with deferring 
> > the
> > other tasks in ShutdownRecoveryTransactionEnvironment() until after
> > RECOVERY_STATE_DONE.
>
> Hmm.  This would mean releasing all the exclusive locks tracked by a
> standby, as of StandbyReleaseAllLocks(), after opening the instance
> for writes after a promotion.  I don't think that's unsafe, but it
> would be intrusive.

Why would it be intrusive? We're talking a split second here, no? More
importantly, I don't think it's correct to release the locks at that
point.


> Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
> where we should try to not wipe out the 2PC entries to make sure that
> all those snapshots still see the 2PC transactions as something to
> count on?  I am attaching a crude patch to show the idea.

I don't think that's sufficient. We can't do most of the other stuff in
ShutdownRecoveryTransactionEnvironment() before changing
XLogCtl->SharedRecoveryState either. As long as the other backends think
we are in recovery, we shouldn't release e.g. the virtual transaction.

Greetings,

Andres Freund

Re: CREATE COLLATION - check for duplicate options and error out if found one

2021-05-27 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 8:36 PM vignesh C  wrote:
>
> On Wed, May 26, 2021 at 7:44 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, May 26, 2021 at 7:18 PM vignesh C  wrote:
> > > +1 for fixing this issue, we have handled this error in other places.
> > > The patch does not apply on head, could you rebase the patch on head
> > > and post a new patch.
> >
> > Thanks. I thought of rebasing once the other patch (which reorganizes
> > "...specified more than once" error) gets committed. Anyways, I've
> > rebased for now on the latest master. Please review v2 patch.
>
> The test changes look good to me, I liked the idea of rebasing the
> source changes once "specified more than once" patch gets committed. I
> will review the code changes after that.

I'm not sure which patch goes first. I think the review can be
finished and see which one will be picked up first by the committer.
Based on that, the other patch can be rebased.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> That's not really a reason to rule it out though and Bruce's point about
> having a way to get to an encrypted cluster from an unencrypted one is
> certainly worth consideration.  Naturally, we'd need to document
> everything appropriately but there isn't anything saying that we
> couldn't, say, have XTS in v15 without any adjustments to the page
> layout, accepting that there's no data integrity validation and focusing
> just on encryption, and then returning to the question about adding in
> data integrity validation for a future version, perhaps using the
> special area for a nonce+tag with GCM or maybe something else.  Users
> who wish to move to a cluster with encryption and data integrity
> validation would have to get there through some other means than
> replication, but that's going to always be the case because we have to
> have space to store the tag, even if we can figure out some other
> solution for the nonce.

But won't we then end up with a different set of requirements around
nonce assignment durability when introducing GCM support? That's not
actually entirely trivial to do correctly on a standby. I guess we can
use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
reuse, but ...

Greetings,

Andres Freund

Re: Reducing the range of OIDs consumed by genbki.pl

2021-05-27 Thread John Naylor

On Wed, May 26, 2021 at 5:43 PM Tom Lane  wrote:
>
> The attached patch stems from the conversation at [1];
> I'm starting a new thread to avoid confusing the cfbot.

The patch looks good to me.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> > That's not really a reason to rule it out though and Bruce's point about
> > having a way to get to an encrypted cluster from an unencrypted one is
> > certainly worth consideration.  Naturally, we'd need to document
> > everything appropriately but there isn't anything saying that we
> > couldn't, say, have XTS in v15 without any adjustments to the page
> > layout, accepting that there's no data integrity validation and focusing
> > just on encryption, and then returning to the question about adding in
> > data integrity validation for a future version, perhaps using the
> > special area for a nonce+tag with GCM or maybe something else.  Users
> > who wish to move to a cluster with encryption and data integrity
> > validation would have to get there through some other means than
> > replication, but that's going to always be the case because we have to
> > have space to store the tag, even if we can figure out some other
> > solution for the nonce.
> 
> But won't we then end up with a different set of requirements around
> nonce assignment durability when introducing GCM support? That's not
> actually entirely trivial to do correctly on a standby. I guess we can
> use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> reuse, but ...

Not sure if I'm entirely following the question but I would have thought
the up-thread idea of generating a random part of the nonce for each
start up and then a global counter for the rest, which would be written
whenever the page is updated (meaning it wouldn't have anything to do
with the LSN and would be stored in the special area as Robert
contemplated) would work for both primaries and replicas.

Taking a step back, while I like the idea of trying to think through
these complications in a future world where we add GCM support, if we're
actually agreed on seriously looking at XTS for v15 then maybe we should
focus on that for the moment.  As Bruce says, there's a lot of moving
parts in this patch that likely need discussion and agreement in order
for us to be able to move forward with it.  For one, we'd probably want
to get agreement on what we'd use to construct the tweak, for starters.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Reducing the range of OIDs consumed by genbki.pl

2021-05-27 Thread Tom Lane

John Naylor  writes:
> The patch looks good to me.

Thanks for reviewing!

regards, tom lane

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 12:15 PM Bruce Momjian  wrote:
> > Well, in the design where the nonce is stored in the page, there is no
> > need for every hint-type change to appear in the WAL at all. Once per
> > checkpoint cycle, you need to write a full page image, as we do for
> > checksums or wal_log_hints. The rest of the time, you can just bump
> > the nonce and rewrite the page, same as we do today.
>
> What is it about having the nonce be the LSN that doesn't allow that to
> happen?  Could we just create a dummy LSN record and assign that to the
> page and use that as a nonce.

I can't tell which of two possible proposals you are describing here.
If the LSN is used to derive the nonce, then one option is to just log
a WAL record every time we need a new nonce. As I understand it,
that's basically what you've already implemented, and we've discussed
the disadvantages of that approach at some length already. The basic
problems seem to be:

- It's potentially very expensive if page evictions are frequent,
which they will be whenever the workload is write-heavy and the
working set is larger than shared_buffers.
- If there's ever a situation where we need to write a page image
different from any page image written previously and we cannot at that
time write a WAL record to generate a new LSN for use as the nonce,
then the algorithm is broken entirely. Andres's latest post points out
- I think correctly - that this happens on standbys, because WAL
replay does not generate byte-identical results on standbys even if
you ignore hint bits.

The first point strikes me as a sufficiently serious performance
problem to justify giving up on this design, but that's a judgement
call. The second one seems like it breaks it entirely.

Now, there's another possible direction that is also suggested by your
remarks here: maybe you meant using a fake LSN in cases where we can't
use a real one. For example, suppose you decide to reserve half of the
LSN space - all LSNs with the high bit set, for example - for this
purpose. Well, you somehow need to ensure that you never use one of
those values more than once, so you might think of putting a counter
in shared memory. But now imagine a master with two standbys. How
would you avoid having the same counter value used on one standby and
also on the other standby? Even if they use the same counter for
different pages, it's a critical security flaw. And since those
standbys don't even need to know that the other one exists, that seems
pretty well impossible to avoid.

Now you might ask why we don't have the same problem if we store the
nonce in the special space. One difference is that if you store the
nonce explicitly, you can allow however much bit space you need in
order to guarantee uniqueness, whereas reserving half the LSN space
only gives you 63 bits. That's not enough to achieve uniqueness
without tight coordination. With 128 bits, you can do things like just
generate random values and assume they're vanishingly unlikely to
collide, or randomly generate half the value and use the other half as
a counter and be pretty safe. With 63 bits you just don't have enough
bit space available to reliably avoid collisions using algorithms of
that type, due to the birthday paradox. I think it would be adequate
for uniqueness if there were a single shared counter and every
allocation came from it, but again, as soon as you imagine a master
and a bunch of standbys, that breaks down.

Also, it's not entirely clear to me that you can avoid needing the LSN
space on the page for a real LSN at the same time you also need it for
a fake-LSN-being-used-as-a-nonce. We rely on the LSN field containing
the LSN of the last WAL record for the page in order to obey the
WAL-before-data rule, without which crash recovery will not work
reliably. Now, if you sometimes have to use that field for a nonce
that is a fake LSN, that means you no longer always have a place to
store the real LSN. I can't convince myself off-hand that it's
completely impossible to work around that problem, but it seems like
any attempt to do so would be complicated and fragile at best. I don't
think that's a direction that we want to go. Making crash recovery
work reliably is a hard problem where we've had lots of bugs despite
years of dedicated effort. TDE is also complex and has lots of
pitfalls of its own. If we take two things which are individually
complicated and hard to get right and intertwine them by making them
share bit-space, I think it drives the complexity up to a level where
we don't have much hope of getting things right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 12:00:03 -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> > Rather than surprise anyone, I might as well just come out and say some
> > things.  First, I have always admitted this feature has limited
> > usefulness.  
> > 
> > I think a non-LSN nonce adds a lot of code complexity, which adds a code
> > and maintenance burden.  It also prevents the creation of an encrypted
> > replica from a non-encrypted primary using binary replication, which
> > makes deployment harder.
> > 
> > Take a feature of limited usefulness, add code complexity and deployment
> > difficulty, and the feature becomes even less useful.
> > 
> > For these reasons, if we decide to go in the direction of using a
> > non-LSN nonce, I no longer plan to continue working on this feature. I
> > would rather work on things that have a more positive impact.  Maybe a
> > non-LSN nonce is a better long-term plan, but there are too many
> > unknowns and complexity for me to feel comfortable with it.
> 
> [...]
> I suspect that if we start adding a non-LSN nonce and malicious write
> detection, we will end up with the same problem --- a complex patch for
> a feature that has limited usefulness, and requires dump/restore or
> logical replication to add it to a cluster.  I think such a patch would
> be rejected, and I would probably even vote against it myself.

I think it's diametrically the opposite. Using the LSN as the nonce
requires that all code modifying pages needs to be audited (which
clearly hasn't been done yet), whereas an independent nonce can be
maintained in a few central places. And that's not just a one-off issue,
it's a forevermore issue.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 12:49 PM Stephen Frost  wrote:
> Right, if we simply can't solve the nonce-reuse concern then that would
> be better.

Given the issues that Andres raised about standbys and the treatment
of the "hole," I see using the LSN for the nonce as a dead-end. I
think it's pretty bad on performance grounds too, for reasons already
discussed, but you could always hypothesize that people care so much
about security that they will ignore any amount of trouble with
performance. You can hardly hypothesize that those same people also
won't mind security vulnerabilities that expose tuple data, though.

I don't think the idea of storing the nonce at the end of the page is
dead. There seem to be some difficulties there, but I think there are
reasonable prospects of solving them. At the very least there's the
brute-force approach of generating a ton of cryptographically strong
random numbers, and there seems to be some possibility of doing better
than that.

However, I'm pretty excited by this idea of using XTS. Now granted I
didn't have the foggiest idea what XTS was before today, but I hear
you and Andres saying that we can use that approach without needing a
nonce at all. That seems to make a lot of the problems we're talking
about here just go away.

> > Nor does it provide integrity - which does seem like a significant issue
> > going forward. Which does require storing additional per-page data...
>
> Yeah, this is one of the reasons that I hadn't really been thrilled with
> XTS- I've really been looking down the road at eventually having GCM and
> having actual integrity validation included.
>
> That's not really a reason to rule it out though and Bruce's point about
> having a way to get to an encrypted cluster from an unencrypted one is
> certainly worth consideration.  Naturally, we'd need to document
> everything appropriately but there isn't anything saying that we
> couldn't, say, have XTS in v15 without any adjustments to the page
> layout, accepting that there's no data integrity validation and focusing
> just on encryption, and then returning to the question about adding in
> data integrity validation for a future version, perhaps using the
> special area for a nonce+tag with GCM or maybe something else.  Users
> who wish to move to a cluster with encryption and data integrity
> validation would have to get there through some other means than
> replication, but that's going to always be the case because we have to
> have space to store the tag, even if we can figure out some other
> solution for the nonce.

+1 from me to all of this except the idea of foreclosing present
discussion on how data-integrity validation could be made to work. I
think it would great to have more discussion of that problem now, in
case it informs our decisions about anything else, especially because,
based on your earlier remarks, it seems like there is some coupling
between the two problems.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 10:39 AM Tom Lane  wrote:
> What I'm hearing is a whole lot of hypothesizing and zero evidence of
> actual field requirements.  On the other side of the coin, we've already
> wasted significant person-hours on fixing this feature's memory leakage,
> and now people are proposing to expend more effort on solving^Wpapering
> over its performance issues by adding yet more user-visible complication.
> It's already adding too much user-visible complication IMO --- I know
> because I was just copy-editing the documentation about that yesterday.
>
> I say it's time to stop the bleeding and rip it out.  When and if
> there are actual field requests to have a way to do this, we can
> discuss what's the best way to respond to those requests.  Hacking
> VACUUM probably isn't the best answer, anyway.  But right now,
> we are past feature freeze, and I think we ought to jettison this
> one rather than quickly kluge something.

Thanks for sharing your thoughts. -1 from me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 13:26:11 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> > > That's not really a reason to rule it out though and Bruce's point about
> > > having a way to get to an encrypted cluster from an unencrypted one is
> > > certainly worth consideration.  Naturally, we'd need to document
> > > everything appropriately but there isn't anything saying that we
> > > couldn't, say, have XTS in v15 without any adjustments to the page
> > > layout, accepting that there's no data integrity validation and focusing
> > > just on encryption, and then returning to the question about adding in
> > > data integrity validation for a future version, perhaps using the
> > > special area for a nonce+tag with GCM or maybe something else.  Users
> > > who wish to move to a cluster with encryption and data integrity
> > > validation would have to get there through some other means than
> > > replication, but that's going to always be the case because we have to
> > > have space to store the tag, even if we can figure out some other
> > > solution for the nonce.
> > 
> > But won't we then end up with a different set of requirements around
> > nonce assignment durability when introducing GCM support? That's not
> > actually entirely trivial to do correctly on a standby. I guess we can
> > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> > reuse, but ...
> 
> Not sure if I'm entirely following the question

It seems like it might end up with lots of duplicated effort to go for
XTS in the short term, if we medium term then have to solve all the
issues around how to maintain efficiently and correctly nonces anyway,
because we want integrity support.


> but I would have thought the up-thread idea of generating a random
> part of the nonce for each start up and then a global counter for the
> rest, which would be written whenever the page is updated (meaning it
> wouldn't have anything to do with the LSN and would be stored in the
> special area as Robert contemplated) would work for both primaries and
> replicas.

Yea, it's not a bad approach. Particularly because it removes the need
to ensure that "global nonce counter" increments are guaranteed to be
durable.


> For one, we'd probably want to get agreement on what we'd use to
> construct the tweak, for starters.

Hm, isn't that just a pg_strong_random() and storing it encrypted?

Greetings,

Andres Freund

Re: Race condition in recovery?

2021-05-27 Thread Robert Haas

On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi
 wrote:
> So in the mail [1] and [2] I tried to describe what's going on around
> the two issues.  Although I haven't have a response to [2], can I
> think that we clarified the intention of ee994272ca?  And may I think
> that we decided that we don't add a test for the commit?

Regarding the first question, I feel that the intention of ee994272ca
is fairly clear at this point. Someone else might feel differently so
I won't presume to speak for anyone but me.

Regarding the second question, I am not opposed to adding a test for
that commit, but I think it is a lot more important to fix the bug we
have now than to add a test for a bug that was fixed a long time ago.

> Then it seems to me that Robert refound how to reproduce Dilip's case
> using basebackup instead of using two standbys.  It is using a broken
> archive_command with pg_basebackup -Xnone and I showed that the same
> resulting state is available by pg_basebackup -Xstream/fetch clearing
> pg_wal directory of the resulting backup including an explanation of
> why.

Yes, it makes sense that we could get to the same state either by not
fetching the WAL in the first place, or alternatively by fetching it
and then removing it.

> *I* think that it is better to avoid to have the archive_command since
> it seems to me that just unlinking some files seems simpler tha having
> the broken archive_command.  However, since Robert ignored it, I guess
> that Robert thinks that the broken archive_command is better than
> that.

Well ... I don't see those things as quite related. As far as I can
see, unlinking files from pg_wal is an alternative to using -Xnone. On
the other hand, the broken archive_command is there to make sure the
new primary doesn't archive its WAL segment too soon.

Regarding the first point, I think using -Xnone is better than using
-Xfetch/stream and then removing the WAL, because (1) it doesn't seem
efficient to fetch WAL only to turn around and remove it and (2)
someone might question whether removing the WAL afterward is a
supported procedure, whereas using an option built into the tool must
surely be supported.

Regarding the second point, I think using the broken archive command
is superior because we can be sure of the behavior. If we just rely on
not having crossed a segment boundary, then anything that causes more
WAL to be generated than we are expecting could break the test. I
don't think it's particularly likely in a case like this that
autovacuum or any other thing would kick in and generate extra WAL,
but the broken archive command ensures that even if it does happen,
the test will still work as intended. That, to me, seems like a good
enough reason to do it that way.

> FWIW, regarding to the name of the test script, putting aside what it
> actually does, I proposed to place it as a part or
> 004_timeline_switch.pl because this issue is related to timeline
> switching.

I think it is better to keep it separate. Long test scripts that test
multiple things with completely separate tests are hard to read.

Kyotaro-san, I hope I have not given any offense. I am doing my best,
and certainly did not mean to be rude.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2021-05-27 13:26:11 -0400, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> > > > That's not really a reason to rule it out though and Bruce's point about
> > > > having a way to get to an encrypted cluster from an unencrypted one is
> > > > certainly worth consideration.  Naturally, we'd need to document
> > > > everything appropriately but there isn't anything saying that we
> > > > couldn't, say, have XTS in v15 without any adjustments to the page
> > > > layout, accepting that there's no data integrity validation and focusing
> > > > just on encryption, and then returning to the question about adding in
> > > > data integrity validation for a future version, perhaps using the
> > > > special area for a nonce+tag with GCM or maybe something else.  Users
> > > > who wish to move to a cluster with encryption and data integrity
> > > > validation would have to get there through some other means than
> > > > replication, but that's going to always be the case because we have to
> > > > have space to store the tag, even if we can figure out some other
> > > > solution for the nonce.
> > > 
> > > But won't we then end up with a different set of requirements around
> > > nonce assignment durability when introducing GCM support? That's not
> > > actually entirely trivial to do correctly on a standby. I guess we can
> > > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> > > reuse, but ...
> > 
> > Not sure if I'm entirely following the question
> 
> It seems like it might end up with lots of duplicated effort to go for
> XTS in the short term, if we medium term then have to solve all the
> issues around how to maintain efficiently and correctly nonces anyway,
> because we want integrity support.

You and Robert both seem to be going in that direction, one which I tend
to share, while Bruce is very hard set against it from the perspective
that he doesn't view integrity as important (I disagree quite strongly
with that, even if we can't protect everything, I see it as certainly
valuable to protect the primary data) and that this approach adds
complexity (the amount of which doesn't seem to be agreed upon).

I'm also not sure how much of the effort would really be duplicated.

Were we to start with XTS, that's almost drop-in with what Bruce has
(actually, it should simplify some parts since we no longer need to deal
with making sure we always increase the LSN, etc) gives users more
flexibility in terms of getting to an encrypted cluster and solves
certain use-cases.  Very little of that seems like it would be ripped
out if we were to (also) provide a GCM option.

Now, if we were to *only* provide a GCM option then maybe we wouldn't
need to think about the XTS case of having to come up with a tweak
(though that seems like a rather small amount of code) but that would
also mean we need to change the page format and we can't do any kind of
binary/page-level transistion to an encrypted cluster, like we could
with XTS.

Trying to break it down, the end-goal states look like:

GCM-only: no binary upgrade path due to having to store the tag
XTS-only: no data integrity option
GCM+XTS: binary upgrade path for XTS, data integrity with GCM

If we want both a binary upgrade path, and a data integrity option, then
it seems like the only end state which provides both is GCM+XTS, in
which case I don't think there's a lot of actual duplication.

Perhaps there's an "XTS + some other data integrity approach" option
where we could preserve the page format by stuffing information into
another fork or maybe telling users to hash their data and store that
hash as another column which would allow us to avoid implementing GCM,
but I don't see a way to avoid having XTS if we are going to provide a
binary upgrade path.

Perhaps AES-GCM-SIV would be interesting to consider in general, but
that still means we need to find space for the tag and that still
precludes a binary upgrade path.

> > but I would have thought the up-thread idea of generating a random
> > part of the nonce for each start up and then a global counter for the
> > rest, which would be written whenever the page is updated (meaning it
> > wouldn't have anything to do with the LSN and would be stored in the
> > special area as Robert contemplated) would work for both primaries and
> > replicas.
> 
> Yea, it's not a bad approach. Particularly because it removes the need
> to ensure that "global nonce counter" increments are guaranteed to be
> durable.

Right.

> > For one, we'd probably want to get agreement on what we'd use to
> > construct the tweak, for starters.
> 
> Hm, isn't that just a pg_strong_random() and storing it encrypted?

Perhaps it is, but at least in some other cases it's generated based on
sector and block (which maybe could be relfilenode and block for us?):

https://medium.com/asecuritysite-whe

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Alvaro Herrera

On 2021-May-27, Tom Lane wrote:

> I say it's time to stop the bleeding and rip it out.  When and if
> there are actual field requests to have a way to do this, we can
> discuss what's the best way to respond to those requests.  Hacking
> VACUUM probably isn't the best answer, anyway.  But right now,
> we are past feature freeze, and I think we ought to jettison this
> one rather than quickly kluge something.

Sorry, I'm unclear on exactly what are you proposing.  Are you proposing
to rip out the fact that VACUUM FULL promises to recompress everything,
or are you proposing to rip out the whole attcompression feature?

Absolute -1 on the latter from me.  Pluggable compression has taken
years to get to this point, it certainly won't do to give that up.

Now about the former.  If we do think that recompressing causes an
unacceptable 10% slowdown for every single VACUUM FULLs, then yeah we
should discuss changing that behavior -- maybe remove promises of
recompression and wait for pg15 to add "VACUUM (RECOMPRESS)" or
similar.

If it's a 10% slowdown of the only best times (variability unspecified)
and only in corner cases (unlogged tables with no indexes that fit in
shared buffers), then I don't think we should bother.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."  (Andrew Sullivan)

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 1:07 PM Andres Freund  wrote:
> But won't we then end up with a different set of requirements around
> nonce assignment durability when introducing GCM support? That's not
> actually entirely trivial to do correctly on a standby. I guess we can
> use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> reuse, but ...

All these different encryption modes are hard for me to grok.

That said, I want to mention a point which I think may be relevant
here. As far as I know, in the case of a permanent table page, we
never write X then X' then X again. If the change is WAL-logged, then
the LSN advances, and it will never thereafter go backward. Otherwise,
it's something that uses MarkBufferDirtyHint(). As far as I know, all
of those changes are one-way. For example, we set hint bits without
logging the change, but anything that clears hint bits is logged. We
mark btree index items dead as a type of hint, but they never come
back to life again; instead, they get cleaned out of the page entirely
as a WAL-logged operation. So I don't know that an adversary seeing
the same exact ciphertext multiple times is really likely to occur.

Well, it could certainly occur for temporary or unlogged tables, since
those have LSN = 0. And in cases were we currently copy pages around,
like creating a new database, it could happen. I suspect those cases
could be fixed, if we cared enough, and there are independent reasons
to want to fix the create-new-database case. It would be fairly easy
to put fake LSNs in temporary buffers, since they're in a separate
pool of buffers in backend-private memory with a separate buffer
manager. And it could probably even be done for unlogged tables,
though not as easily. Or we could use the special-space technique to
put some unpredictable garbage into each page and then change the
garbage every time we write the page. I read the discussion so far to
say that maybe these kinds of measures aren't even needed, and if so,
great. But even without doing anything, I don't think it's going to
happen very much.

Another case where this sort of thing might happen is a standby doing
whatever the master did. I suppose that could be avoided if the
standby always has its own encryption keys, but that forces a key
rotation when you create a standby, and it doesn't seem like a lot of
fun to insist on that. But the information leak seems minor. If we get
to a point where an adversary with full filesystem access on all our
systems can't do better than assessing our replication lag, we'll be a
lot better off then than we are now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> I'm also not sure how much of the effort would really be duplicated.
>
> Were we to start with XTS, that's almost drop-in with what Bruce has
> (actually, it should simplify some parts since we no longer need to deal
> with making sure we always increase the LSN, etc) gives users more
> flexibility in terms of getting to an encrypted cluster and solves
> certain use-cases.  Very little of that seems like it would be ripped
> out if we were to (also) provide a GCM option.

> Now, if we were to *only* provide a GCM option then maybe we wouldn't
> need to think about the XTS case of having to come up with a tweak
> (though that seems like a rather small amount of code) but that would
> also mean we need to change the page format and we can't do any kind of
> binary/page-level transistion to an encrypted cluster, like we could
> with XTS.

> Trying to break it down, the end-goal states look like:
>
> GCM-only: no binary upgrade path due to having to store the tag
> XTS-only: no data integrity option
> GCM+XTS: binary upgrade path for XTS, data integrity with GCM

Why would GCM + XTS make sense? Especially if we were to go with
AES-GCM-SIV or something, drastically reducing the danger of nonce
reuse?

And I don't think there's an easy way to do both using openssl, without
double encrypting, which we'd obviously not want for performance
reasons. And I don't think we'd want to implement either ourselves -
leaving other dangers aside, I don't think we want to do the
optimization work necessary to get good performance.

> If we want both a binary upgrade path, and a data integrity option, then
> it seems like the only end state which provides both is GCM+XTS, in
> which case I don't think there's a lot of actual duplication.

I honestly feel that Bruce's point about trying to shoot for the moon,
and thus not getting the basic feature done, applies much more to the
binary upgrade path than anything else. I think we should just stop
aiming for that for now. If we later want to add code that goes through
the cluster to ensure that there's enough space on each page for
integrity data, to provide a migration path, fine. But we shouldn't make
the binary upgrade path for TED a hard requirement.

> > > For one, we'd probably want to get agreement on what we'd use to
> > > construct the tweak, for starters.
> >
> > Hm, isn't that just a pg_strong_random() and storing it encrypted?
>
> Perhaps it is, but at least in some other cases it's generated based on
> sector and block (which maybe could be relfilenode and block for us?):
>
> https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac

My understanding is that you'd use
tweak_secret + block_offset
or
someop(tweak_secret, relfilenode) block_offset

to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 3:22 PM Stephen Frost  wrote:
> Trying to break it down, the end-goal states look like:
>
> GCM-only: no binary upgrade path due to having to store the tag
> XTS-only: no data integrity option
> GCM+XTS: binary upgrade path for XTS, data integrity with GCM
>
> If we want both a binary upgrade path, and a data integrity option, then
> it seems like the only end state which provides both is GCM+XTS, in
> which case I don't think there's a lot of actual duplication.
>
> Perhaps there's an "XTS + some other data integrity approach" option
> where we could preserve the page format by stuffing information into
> another fork or maybe telling users to hash their data and store that
> hash as another column which would allow us to avoid implementing GCM,
> but I don't see a way to avoid having XTS if we are going to provide a
> binary upgrade path.
>
> Perhaps AES-GCM-SIV would be interesting to consider in general, but
> that still means we need to find space for the tag and that still
> precludes a binary upgrade path.

Anything that decouples features without otherwise losing ground is a
win. If there are things A and B, such that A does encryption and B
does integrity validation, and A and B can be turned on and off
independently of each other, that is better than some
otherwise-comparable C that provides both features.

But I'm going to have to defer to you and Andres and whoever else on
whether that's true for any encryption methods/modes in particular.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Peter Geoghegan

On Thu, May 27, 2021 at 7:25 AM Robert Haas  wrote:
> TBH, I'm more concerned about the other direction. Surely someone who
> upgrades from an existing release to v14 and sets their compression
> method to lz4 is going to want a way of actually converting their data
> to using lz4.

Your argument would be more convincing (at least to me) if we really
did expect users to want to pick and choose, based on natural
variations in datasets that make switching to *either* potentially
yield a real benefit. It is my understanding that lz4 is pretty much
superior to pglz by every relevant measure, though, so I'm not sure
that that argument can be made. At the same time, users tend to only
care specifically about things that are real step changes -- which I
don't think this qualifies as. Users will go out of their way to get one of
those, but otherwise won't bother.

Perhaps there is a practical argument in favor of VACUUM FULL reliably
recompressing using lz4 on upgrade, where that's the user's stated
preference. It's not self-evident that VACUUM FULL must or even should
do that, at least to me. I'm not suggesting that there must not be
such an argument. Just that I don't think that anybody has made such
an argument.

> To say that nobody cares about that is to deem the
> feature useless. Maybe that's what Tom thinks, but it's not what I
> think.

I don't think that follows at all.

--
Peter Geoghegan

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 15:48:09 -0400, Robert Haas wrote:
> That said, I want to mention a point which I think may be relevant
> here. As far as I know, in the case of a permanent table page, we
> never write X then X' then X again.

Well, there's crash recovery / restarts. And as previously explained
they can end up with different page contents than before.

> And in cases were we currently copy pages around, like creating a new
> database, it could happen.

As long as its identical data that should be fine, except leaking that
the data is identical. Which doesn't make me really concerned in case of
template databases.

> I suspect those cases could be fixed, if we cared enough, and there
> are independent reasons to want to fix the create-new-database
> case. It would be fairly easy to put fake LSNs in temporary buffers,
> since they're in a separate pool of buffers in backend-private memory
> with a separate buffer manager. And it could probably even be done for
> unlogged tables, though not as easily. [...] I read
> the discussion so far to say that maybe these kinds of measures aren't
> even needed, and if so, great. But even without doing anything, I
> don't think it's going to happen very much.

What precisely are you referring to with "aren't even needed"?

I don't see how the fake LSN approach can work for the crash recovery
issues?

> Or we could use the special-space technique to put some unpredictable
> garbage into each page and then change the garbage every time we write
> the page

Unfortunately with CTR mode that doesn't provide much protection, if
it's part of the encrypted data (vs IV/nonce). A one bit change in the
encrypted data only changes one bit in the unencrypted data, as the data
is just XORd with the cipher stream. So random changes in one place
doesn't prevent disclosure in other parts of the data if the nonce
doesn't also change.  And one can easily predict the effect of flipping
certain bits.

> Another case where this sort of thing might happen is a standby doing
> whatever the master did. I suppose that could be avoided if the
> standby always has its own encryption keys, but that forces a key
> rotation when you create a standby, and it doesn't seem like a lot of
> fun to insist on that. But the information leak seems minor.

Which leaks seem minor? The "hole" issues leak all the prior contents of
the hole, without needing any complicated analysis of the data, because
one plain text is known (zeroes).

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> > I'm also not sure how much of the effort would really be duplicated.
> >
> > Were we to start with XTS, that's almost drop-in with what Bruce has
> > (actually, it should simplify some parts since we no longer need to deal
> > with making sure we always increase the LSN, etc) gives users more
> > flexibility in terms of getting to an encrypted cluster and solves
> > certain use-cases.  Very little of that seems like it would be ripped
> > out if we were to (also) provide a GCM option.
> 
> > Now, if we were to *only* provide a GCM option then maybe we wouldn't
> > need to think about the XTS case of having to come up with a tweak
> > (though that seems like a rather small amount of code) but that would
> > also mean we need to change the page format and we can't do any kind of
> > binary/page-level transistion to an encrypted cluster, like we could
> > with XTS.
> 
> > Trying to break it down, the end-goal states look like:
> >
> > GCM-only: no binary upgrade path due to having to store the tag
> > XTS-only: no data integrity option
> > GCM+XTS: binary upgrade path for XTS, data integrity with GCM
> 
> Why would GCM + XTS make sense? Especially if we were to go with
> AES-GCM-SIV or something, drastically reducing the danger of nonce
> reuse?

You can't get to a GCM-based solution without changing the page format
and therefore you can't get there using streaming replication or a
pg_upgrade that does an encrypt step along with the copy.

> And I don't think there's an easy way to do both using openssl, without
> double encrypting, which we'd obviously not want for performance
> reasons. And I don't think we'd want to implement either ourselves -
> leaving other dangers aside, I don't think we want to do the
> optimization work necessary to get good performance.

E, clearly a misunderstanding here- what I'm suggesting is that we'd
have initdb options where someone could initdb and say they want XTS, OR
they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV).  I'm
not talking about doing both in the cluster at the same time..

Or, with XTS, we could have an option to pg_basebackup + encrypt into
XTS to build an encrypted replica from an unencrypted cluster.  There
isn't any way we could do that with GCM though since we wouldn't have
any place to put the tag.

> > If we want both a binary upgrade path, and a data integrity option, then
> > it seems like the only end state which provides both is GCM+XTS, in
> > which case I don't think there's a lot of actual duplication.
> 
> I honestly feel that Bruce's point about trying to shoot for the moon,
> and thus not getting the basic feature done, applies much more to the
> binary upgrade path than anything else. I think we should just stop
> aiming for that for now. If we later want to add code that goes through
> the cluster to ensure that there's enough space on each page for
> integrity data, to provide a migration path, fine. But we shouldn't make
> the binary upgrade path for TED a hard requirement.

Ok, that's a pretty clear fundamental disagreement between you and
Bruce.  For my 2c, I tend to agree with you that the binary upgrade path
isn't that critical.

If we agree to forgo the binary upgrade requirement and are willing to
accept Robert's approach to use the special area for the nonce+tag, or
similar, then we could perhaps avoid the work of supporting XTS.

> > > > For one, we'd probably want to get agreement on what we'd use to
> > > > construct the tweak, for starters.
> > >
> > > Hm, isn't that just a pg_strong_random() and storing it encrypted?
> >
> > Perhaps it is, but at least in some other cases it's generated based on
> > sector and block (which maybe could be relfilenode and block for us?):
> >
> > https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac
> 
> My understanding is that you'd use
> tweak_secret + block_offset
> or
> someop(tweak_secret, relfilenode) block_offset
> 
> to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak.

The above article, at least, suggested encrypting the sector number
using the second key and then multiplying that times 2^(block number),
where those blocks were actually AES 128bit blocks.  The article further
claims that this is what's used in things like Bitlocker, TrueCrypt,
VeraCrypt and OpenSSL.

While the documentation isn't super clear, I'm taking that to mean that
when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
with a 256-bit key (twice the size of the AES key length function), and
you give it a 'tweak', that what you would actually be passing in would
be the "sector number" in the above method, or for us perhaps it would
be relfilenode+block number, or maybe just block number but it seems
like it'd be better to include the relfilenode to me.

OpenSSL docs:

https://www.open

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Tom Lane

Alvaro Herrera  writes:
> Sorry, I'm unclear on exactly what are you proposing.  Are you proposing
> to rip out the fact that VACUUM FULL promises to recompress everything,
> or are you proposing to rip out the whole attcompression feature?

Just the former.

regards, tom lane

Re: storing an explicit nonce

2021-05-27 Thread Alvaro Herrera

On 2021-May-27, Andres Freund wrote:

> On 2021-05-27 15:48:09 -0400, Robert Haas wrote:

> > Another case where this sort of thing might happen is a standby doing
> > whatever the master did. I suppose that could be avoided if the
> > standby always has its own encryption keys, but that forces a key
> > rotation when you create a standby, and it doesn't seem like a lot of
> > fun to insist on that. But the information leak seems minor.
> 
> Which leaks seem minor? The "hole" issues leak all the prior contents of
> the hole, without needing any complicated analysis of the data, because
> one plain text is known (zeroes).

Maybe that problem could be solved by having PageRepairFragmentation,
compactify_tuples et al always fill the hole with zeroes, in encrypted
databases.

-- 
Álvaro Herrera   Valdivia, Chile

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Tom Lane

Alvaro Herrera  writes:
> Now about the former.  If we do think that recompressing causes an
> unacceptable 10% slowdown for every single VACUUM FULLs, then yeah we
> should discuss changing that behavior -- maybe remove promises of
> recompression and wait for pg15 to add "VACUUM (RECOMPRESS)" or
> similar.
> If it's a 10% slowdown of the only best times (variability unspecified)
> and only in corner cases (unlogged tables with no indexes that fit in
> shared buffers), then I don't think we should bother.

BTW, perhaps I should clarify my goal here: it's to cut off expending
further effort on this feature during v14.  If we can decide that the
existing performance situation is acceptable, I'm content with that
decision.  But if we're to start designing new user-visible behavior to
satisfy performance objections, then I'd prefer to remove this VACUUM
behavior altogether for now.

regards, tom lane

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 16:09:13 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> > > I'm also not sure how much of the effort would really be duplicated.
> > >
> > > Were we to start with XTS, that's almost drop-in with what Bruce has
> > > (actually, it should simplify some parts since we no longer need to deal
> > > with making sure we always increase the LSN, etc) gives users more
> > > flexibility in terms of getting to an encrypted cluster and solves
> > > certain use-cases.  Very little of that seems like it would be ripped
> > > out if we were to (also) provide a GCM option.
> > 
> > > Now, if we were to *only* provide a GCM option then maybe we wouldn't
> > > need to think about the XTS case of having to come up with a tweak
> > > (though that seems like a rather small amount of code) but that would
> > > also mean we need to change the page format and we can't do any kind of
> > > binary/page-level transistion to an encrypted cluster, like we could
> > > with XTS.
> > 
> > > Trying to break it down, the end-goal states look like:
> > >
> > > GCM-only: no binary upgrade path due to having to store the tag
> > > XTS-only: no data integrity option
> > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM
> >
> [...]
> > And I don't think there's an easy way to do both using openssl, without
> > double encrypting, which we'd obviously not want for performance
> > reasons. And I don't think we'd want to implement either ourselves -
> > leaving other dangers aside, I don't think we want to do the
> > optimization work necessary to get good performance.
> 
> E, clearly a misunderstanding here- what I'm suggesting is that we'd
> have initdb options where someone could initdb and say they want XTS, OR
> they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV).  I'm
> not talking about doing both in the cluster at the same time..

Ah, that makes more sense ;). So the end goal states are the different
paths we could take?


> Still leaves us with the question of what exactly we should pass into
> OpenSSL as the 'tweak', if it should be the block offset inside the
> file only, or the block offset + relfilenode, or something else.

I think it has to include the relfilenode as a minimum. It'd not be
great if you could identify equivalent blocks in different tables. It
might even be worth complicating createdb() a bit and including the
dboid as well.

Greetings,

Andres Freund

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 16:13:44 -0400, Alvaro Herrera wrote:
> Maybe that problem could be solved by having PageRepairFragmentation,
> compactify_tuples et al always fill the hole with zeroes, in encrypted
> databases.

If that were the only issue, maybe. But there's plenty other places were
similar things happen. Look at all the stuff that needs to be masked out
for wal consistency checking (checkXLogConsistency() + all the things it
calls). And there's no way proposed to actually have a maintainable way
of detecting omissions around this.

Greetings,

Andres Freund

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Tom Lane

Peter Geoghegan  writes:
> On Thu, May 27, 2021 at 7:25 AM Robert Haas  wrote:
>> To say that nobody cares about that is to deem the
>> feature useless. Maybe that's what Tom thinks, but it's not what I
>> think.

> I don't think that follows at all.

Yeah.  My belief here is that users might bother to change
default_toast_compression, or that we might do it for them in a few
years, but the gains from doing so are going to be only incremental.
That being the case, most DBAs will be content to allow the older
compression method to age out of their databases through routine row
updates.  The idea that somebody is going to be excited enough about
this to run a downtime-inducing VACUUM FULL doesn't really pass the
smell test.

That doesn't make LZ4 compression useless, by any means, but it does
suggest that we shouldn't be adding overhead to VACUUM FULL for the
purpose of easing instantaneous switchovers.

I'll refrain from re-telling old war stories about JPEG/GIF/PNG, but
I do have real-world experience with compression algorithm changes.
IME you need an integer-multiples type of improvement to get peoples'
attention, and LZ4 isn't going to offer that, except maybe in
cherry-picked examples.

regards, tom lane

Re: Support for NSS as a libpq TLS backend

2021-05-27 Thread Daniel Gustafsson

> On 25 Mar 2021, at 00:56, Jacob Champion  wrote:

> Databases that are opened *after* the first one are given their own separate 
> slots. Any certificates that are part of those databases seemingly can't be 
> referenced directly by nickname. They have to be prefixed by their token name 
> -- a name which you don't have if you used NSS_InitContext() to create the 
> database. You have to use SECMOD_OpenUserDB() instead. This explains some 
> strange failures I was seeing in local testing, where the order of 
> InitContext determined whether our client certificate selection succeeded or 
> failed.

Sorry for the latency is responding, but I'm now back from parental leave.

AFAICT the tokenname for the database can be set with the dbTokenDescription
member in the NSSInitParameters struct passed to NSS_InitContext() (documented
in nss.h).  Using this we can avoid the messier SECMOD machinery and use the
token in the auth callback to refer to the database we loaded.  I hacked this
up in my local tree (rebased patchset coming soon) and it seems to work as
intended.

--
Daniel Gustafsson   https://vmware.com/

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2021-05-27 16:09:13 -0400, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> > > > I'm also not sure how much of the effort would really be duplicated.
> > > >
> > > > Were we to start with XTS, that's almost drop-in with what Bruce has
> > > > (actually, it should simplify some parts since we no longer need to deal
> > > > with making sure we always increase the LSN, etc) gives users more
> > > > flexibility in terms of getting to an encrypted cluster and solves
> > > > certain use-cases.  Very little of that seems like it would be ripped
> > > > out if we were to (also) provide a GCM option.
> > > 
> > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't
> > > > need to think about the XTS case of having to come up with a tweak
> > > > (though that seems like a rather small amount of code) but that would
> > > > also mean we need to change the page format and we can't do any kind of
> > > > binary/page-level transistion to an encrypted cluster, like we could
> > > > with XTS.
> > > 
> > > > Trying to break it down, the end-goal states look like:
> > > >
> > > > GCM-only: no binary upgrade path due to having to store the tag
> > > > XTS-only: no data integrity option
> > > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM
> > >
> > [...]
> > > And I don't think there's an easy way to do both using openssl, without
> > > double encrypting, which we'd obviously not want for performance
> > > reasons. And I don't think we'd want to implement either ourselves -
> > > leaving other dangers aside, I don't think we want to do the
> > > optimization work necessary to get good performance.
> > 
> > E, clearly a misunderstanding here- what I'm suggesting is that we'd
> > have initdb options where someone could initdb and say they want XTS, OR
> > they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV).  I'm
> > not talking about doing both in the cluster at the same time..
> 
> Ah, that makes more sense ;). So the end goal states are the different
> paths we could take?

The end goals are different possible things we could provide support
for, not in one cluster, but in one build of PG.  That is, we could
add support in v15 (or whatever) for:

initdb --encryption-type=AES-XTS

and then in v16 add support for:

initdb --encryption-type=AES-GCM (or AES-GCM-SIV, whatever)

while keeping support for AES-XTS.

Users who just want encryption could go do a binary upgrade of some kind
to a cluster which has AES-XTS encryption, but to get GCM they'd have to
initialize a new cluster and migrate data to it using logical
replication or pg_dump/restore.

There's also been requests for other possible encryption options, so I
don't think these would even be the only options eventually, though I do
think we'd probably have them broken down into "just encryption" or
"encryption + data integrity" with the same resulting limitations
regarding the ability to do binary upgrades.

> > Still leaves us with the question of what exactly we should pass into
> > OpenSSL as the 'tweak', if it should be the block offset inside the
> > file only, or the block offset + relfilenode, or something else.
> 
> I think it has to include the relfilenode as a minimum. It'd not be
> great if you could identify equivalent blocks in different tables. It
> might even be worth complicating createdb() a bit and including the
> dboid as well.

At this point I'm wondering if it's just:

dboid/relfilenode:block-offset

and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128)
(or -256, whatever we're using based on what was passed to initdb)
returns.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: storing an explicit nonce

2021-05-27 Thread Robert Haas

On Thu, May 27, 2021 at 4:04 PM Andres Freund  wrote:
> On 2021-05-27 15:48:09 -0400, Robert Haas wrote:
> > That said, I want to mention a point which I think may be relevant
> > here. As far as I know, in the case of a permanent table page, we
> > never write X then X' then X again.
>
> Well, there's crash recovery / restarts. And as previously explained
> they can end up with different page contents than before.

Right, I'm not trying to oversell this point ... if in system XYZ
there's a serious security exposure from ever repeating a page write,
we should not use system XYZ unless we do some work to make sure that
every page write is different. But if we just think it would be nicer
if page writes didn't repeat, that's probably *mostly* true today
already.

> I don't see how the fake LSN approach can work for the crash recovery
> issues?

I wasn't trying to say it could. You've convinced me on that point.

> > Or we could use the special-space technique to put some unpredictable
> > garbage into each page and then change the garbage every time we write
> > the page
>
> Unfortunately with CTR mode that doesn't provide much protection, if
> it's part of the encrypted data (vs IV/nonce). A one bit change in the
> encrypted data only changes one bit in the unencrypted data, as the data
> is just XORd with the cipher stream. So random changes in one place
> doesn't prevent disclosure in other parts of the data if the nonce
> doesn't also change.  And one can easily predict the effect of flipping
> certain bits.

Yeah, I wasn't talking about CTR mode there. I was just saying if we
wanted to avoid ever repeating a write.

> > Another case where this sort of thing might happen is a standby doing
> > whatever the master did. I suppose that could be avoided if the
> > standby always has its own encryption keys, but that forces a key
> > rotation when you create a standby, and it doesn't seem like a lot of
> > fun to insist on that. But the information leak seems minor.
>
> Which leaks seem minor? The "hole" issues leak all the prior contents of
> the hole, without needing any complicated analysis of the data, because
> one plain text is known (zeroes).

No. You're confusing what I was saying here, in the contents of your
comments about the limitations of AES-GCM-SIV, with the discussion
with Bruce about nonce generation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-05-27 Thread Bruce Momjian

On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote:
> The above article, at least, suggested encrypting the sector number
> using the second key and then multiplying that times 2^(block number),
> where those blocks were actually AES 128bit blocks.  The article further
> claims that this is what's used in things like Bitlocker, TrueCrypt,
> VeraCrypt and OpenSSL.
> 
> While the documentation isn't super clear, I'm taking that to mean that
> when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
> with a 256-bit key (twice the size of the AES key length function), and
> you give it a 'tweak', that what you would actually be passing in would
> be the "sector number" in the above method, or for us perhaps it would
> be relfilenode+block number, or maybe just block number but it seems
> like it'd be better to include the relfilenode to me.

If you go in that direction, you should make sure pg_upgrade preserves
what you use (it does not preserve relfilenode, just pg_class.oid), and
CREATE DATABASE still works with a simple file copy.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-05-27 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote:
> > The above article, at least, suggested encrypting the sector number
> > using the second key and then multiplying that times 2^(block number),
> > where those blocks were actually AES 128bit blocks.  The article further
> > claims that this is what's used in things like Bitlocker, TrueCrypt,
> > VeraCrypt and OpenSSL.
> > 
> > While the documentation isn't super clear, I'm taking that to mean that
> > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
> > with a 256-bit key (twice the size of the AES key length function), and
> > you give it a 'tweak', that what you would actually be passing in would
> > be the "sector number" in the above method, or for us perhaps it would
> > be relfilenode+block number, or maybe just block number but it seems
> > like it'd be better to include the relfilenode to me.
> 
> If you go in that direction, you should make sure pg_upgrade preserves
> what you use (it does not preserve relfilenode, just pg_class.oid), and
> CREATE DATABASE still works with a simple file copy.

Ah, yes, good point, if we support in-place pg_upgrade of an encrypted
cluster then the tweak has to be consistent between the old and new.

I tend to agree with Andres that it'd be reasonable to make CREATE
DATABASE do a bit more work for an encrypted cluster though, so I'm less
concerned about that.

Using pg_class.oid instead of relfilenode seems likely to complicate
things like crash recovery though, wouldn't it?  I wonder if there's
something else we could use.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Justin Pryzby

On Tue, May 25, 2021 at 08:33:47PM -0500, Justin Pryzby wrote:
> On Sun, May 23, 2021 at 12:25:10PM -0400, Tom Lane wrote:
> > However, the more I looked at that code the less I liked it.
> > I think the way that compression selection is handled for indexes,
> > ie consult default_toast_compression on-the-fly, is *far* saner
> > than what is currently implemented for tables.  So I think we
> > should redefine attcompression as "ID of a compression method
> > to use, or \0 to select the prevailing default.  Ignored if
> > attstorage does not permit the use of compression".
> 
> +1
> 
> It reminds me of reltablespace, which is stored as 0 to mean the database's
> default tablespace.

I was surprised to realize that I made this same suggestion last month...
https://www.postgresql.org/message-id/20210320074420.gr11...@telsasoft.com
|..unless we changed attcompression='\0' to mean (for varlena) "the default
|compression".  Rather than "resolving" to the default compression at the time
|the table is created, columns without an explicit compression set would "defer"
|to the GUC (of course, that only affects newly-inserted data).

The original reason for that suggestion Michael handled differently in
63db0ac3f9e6bae313da67f640c95c0045b7f0ee

-- 
Justin

Re: storing an explicit nonce

2021-05-27 Thread Andres Freund

Hi,

On 2021-05-27 16:55:29 -0400, Robert Haas wrote:
> No. You're confusing what I was saying here, in the contents of your
> comments about the limitations of AES-GCM-SIV, with the discussion
> with Bruce about nonce generation.

Ah. I think the focus on LSNs confused me a bit.

FWIW:
Nist guidance on IVs for AES GCM (surprisingly readable):
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf
AES-GCM-SIV (harder to read):
https://eprint.iacr.org/2017/168.pdf

Greetings,

Andres Freund

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-27 Thread Peter Geoghegan

On Thu, May 27, 2021 at 1:29 PM Tom Lane  wrote:
> Yeah.  My belief here is that users might bother to change
> default_toast_compression, or that we might do it for them in a few
> years, but the gains from doing so are going to be only incremental.
> That being the case, most DBAs will be content to allow the older
> compression method to age out of their databases through routine row
> updates.  The idea that somebody is going to be excited enough about
> this to run a downtime-inducing VACUUM FULL doesn't really pass the
> smell test.

That was my original understanding of your position, FWIW. I agree
with all of this.

> That doesn't make LZ4 compression useless, by any means, but it does
> suggest that we shouldn't be adding overhead to VACUUM FULL for the
> purpose of easing instantaneous switchovers.

Right. More generally, there often seems to be value in
under-specifying what a compression option does. Or in treating them
as advisory.

You mentioned the history of SET STORAGE, which seems very relevant. I
am reminded of the example of B-Tree deduplication with unique
indexes, where we selectively apply the optimization based on
page-level details. Deduplication isn't usually useful in unique
indexes (for the obvious reason), though occasionally it is extremely
useful. I think that there might be a variety of things that work a
little like that. It can help with avoiding unnecessary dump and
reload hazards, too.

I am interested in hearing the *principle* behind Robert's position.
This whole area seems like something that might have at least a couple
of different schools of thought. If it is then I'd sincerely like to
hear the other side of the argument.

-- 
Peter Geoghegan

Re: Replacing pg_depend PIN entries with a fixed range check

2021-05-27 Thread Tom Lane

I wrote:
> Robert Haas  writes:
>> The first hunk of the patch seems to back away from the idea that the
>> cutoff is 13000, but the second half of the patch says 13000 still
>> matters. Not sure I understand what's going on there exactly.

> Not sure exactly what you're looking at, but IIRC there is a place
> where the patch is cleaning up after ab596105b's failure to adjust
> bki.sgml to match its change of FirstBootstrapObjectId from 12000
> to 13000.  I hadn't bothered to fix that separately, but I guess
> we should do so, else v14 is going to ship with incorrect docs.

I take that back: I had committed that doc fix, in 1f9b0e693, so
I'm still unsure what was confusing you.  (But a4390abec just
reverted it, anyway.)

Attached is a rebase over a4390abec.  The decision in that commit
to not expect global uniqueness of OIDs above 10K frees us to use
a much simpler solution than before: we can just go ahead and start
the backend's OID counter at 1, and not worry about conflicts,
because the OID generation logic can deal with any conflicts just
fine as long as you're okay with only having per-catalog uniqueness.
So this gets rid of the set_next_oid mechanism that I'd invented in
v2, and yet there's still no notable risk of running out of OIDs in
the 10K-12K range.

While testing this, I discovered something that I either never knew
or had forgotten: the bootstrap backend is itself assigning some
OIDs, specifically OIDs for the composite types associated with most
of the system catalogs (plus their array types).  I find this scary,
because it is happening before we've built the catalog indexes, so
it's impossible to ensure uniqueness.  (Of course, when we do build
the indexes, we'd notice any conflicts; but that's not a solution.)
I think it accidentally works because we don't ask genbki.pl to
assign any pg_type OIDs, but that seems fragile.  Seems like maybe
we should fix genbki.pl to assign those OIDs, and then change
GetNewOidWithIndex to error out in bootstrap mode.  However that's a
pre-existing issue, so I don't feel that this patch needs to be
the one to fix it.

regards, tom lane

diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index db1b3d5e9a..f32208b550 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -418,19 +418,33 @@

 If genbki.pl needs to assign an OID to a catalog
 entry that does not have a manually-assigned OID, it will use a value in
-the range 1—11999.  The server's OID counter is set to 12000
-at the start of a bootstrap run.  Thus objects created by regular SQL
-commands during the later phases of bootstrap, such as objects created
-while running the information_schema.sql script,
-receive OIDs of 12000 or above.
+the range 1—11999.  The server's OID counter is set to 1
+at the start of a bootstrap run, so that any objects created on-the-fly
+during bootstrap processing also receive OIDs in this range.  (The
+usual OID assignment mechanism takes care of preventing any conflicts.)
+   
+
+   
+Objects with OIDs below FirstUnpinnedObjectId (12000)
+are considered pinned, preventing them from being
+deleted.  (There are a small number of exceptions, which are
+hard-wired into IsPinnedObject().)
+initdb forces the OID counter up
+to FirstUnpinnedObjectId as soon as it's ready to
+create unpinned objects.  Thus objects created during the later phases
+of initdb, such as objects created while
+running the information_schema.sql script, will
+not be pinned, while all objects known
+to genbki.pl will be.

 OIDs assigned during normal database operation are constrained to be
 16384 or higher.  This ensures that the range 1—16383 is free
 for OIDs assigned automatically by genbki.pl or
-during bootstrap.  These automatically-assigned OIDs are not considered
-stable, and may change from one installation to another.
+during initdb.  These
+automatically-assigned OIDs are not considered stable, and may change
+from one installation to another.

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 16493209c6..d4e1a13746 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3264,8 +3264,7 @@ SCRAM-SHA-256$:&l
(references pg_class.oid)

-   The OID of the system catalog the dependent object is in,
-   or zero for a DEPENDENCY_PIN entry
+   The OID of the system catalog the dependent object is in

@@ -3275,8 +3274,7 @@ SCRAM-SHA-256$:&l
(references any OID column)

-   The OID of the specific dependent object,
-   or zero for a DEPENDENCY_PIN entry
+   The OID of the specific dependent object

@@ -3467,19 +3465,6 @@ SCRAM-SHA-256$:&l

-
-
- DEPENDENCY_PIN (p)
- 
-

Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns

2021-05-27 Thread Alvaro Herrera

On 2019-Mar-15, Etsuro Fujita wrote:

> (2019/03/04 12:10), Etsuro Fujita wrote:
> > (2019/03/02 3:57), Andres Freund wrote:
> > > FWIW, I pushed the EPQ patch, doing this conversion blindly. It'd be
> > > awesome if you'd check that it actually works...
> > 
> > I'll start the work later this week. I think I can post an (initial)
> > report on that next week, maybe.
> 
> Here is an updated version of the patch [1].

What happened to this patch?  Seems it was forgotten ...

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-05-27 Thread Peter Geoghegan

On Sun, May 23, 2021 at 11:34 PM Masahiko Sawada  wrote:
> I think the possible side effect of this hard-coded
> BYPASS_THRESHOLD_PAGES would be that by default, bulkdelete is not
> called for a long term and the index becomes bloat.

What do you think of the approach taken in the attached POC patch?

The patch makes it possible to disable the optimization by
generalizing the INDEX_CLEANUP reloption to be an enum that looks like
a trinay boolean (not just a plain boolean). INDEX_CLEANUP now accepts
the values 'auto', 'on', and 'off' (plus a variety of alternative
spellings, the usual ones for booleans in Postgres). Now 'auto' is the
default, and 'on' forces the previous behavior inside vacuumlazy.c. It
does not disable the failsafe, though -- INDEX_CLEANUP remains a
fairly mechanical thing.

This approach seems good to me because INDEX_CLEANUP remains
consistent with the original purpose and design of INDEX_CLEANUP --
that was always an option that forced VACUUM to do something special
with indexes. I don't see much downside to this approach, either. As
things stand, INDEX_CLEANUP is mostly superseded by the failsafe, so
we don't really need to talk about wraparound emergencies in the docs
for INDEX_CLEANUP anymore. This seems much more elegant than either
repurposing/reviving cleanup_index_scale_factor (which makes no sense
to me at all) or inventing a new reloption (which would itself be in
tension with INDEX_CLEANUP).

There are some practical issues that make this patch surprisingly
complicated for such a simple problem. For example, I hope that I
haven't missed any subtlety in generalizing a boolean reloption like
this. We've done similar things with GUCs in the past, but this may be
a little different. Another concern with this approach is what it
means for the VACUUM command itself. I haven't added an 'auto'
spelling that is accepted by the VACUUM command in this POC version.
But do I need to at all? Can that just be implied by not having any
INDEX_CLEANUP option? And does StdRdOptions.vacuum_truncate now need
to become a VacOptTernaryValue field too, for consistency with the new
definition of StdRdOptions.vacuum_index_cleanup?

-- 
Peter Geoghegan

0001-Generalize-VACUUM-s-INDEX_CLEANUP-option.patch
Description: Binary data

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> I'm still not sure why the execution time with 0 workers (or serial execution 
> or
> no parallelism involved) on my testing system is 112 sec compared to 58 sec on
> Hou-San's system for the same use case. Maybe the testing system I'm using
> is not of the latest configuration compared to others.

What's the setting of wal_level on your two's systems?  I thought it could be 
that you set it to > minimal, while Hou-san set it to minimal.  (I forgot the 
results of 2 and 4 workers, though.)


Regards
Takayuki Tsunakawa

1 2 >

1 - 100 of 118 matches

Mail list logo