RE: Parallel Inserts in CREATE TABLE AS
From: Bharath Rupireddy Sent: Thursday, May 27, 2021 2:59 PM > On Thu, May 27, 2021 at 12:19 PM houzj.f...@fujitsu.com > wrote: > > BTW, I checked my test results, I was testing INSERT INTO unlogged table. > > What do you mean by "testing INSERT INTO"? Is it that you are testing the > timings for parallel inserts in INSERT INTO ... SELECT command? If so, why > should we test parallel inserts in the INSERT INTO ... SELECT command here? Oops, sorry, it's a typo, I actually meant CREATE TABLE AS SELECT. Best regards, houzj
RE: Parallel Inserts in CREATE TABLE AS
From: Dilip Kumar Basically you are creating a new table and loading data to it and that means you will be less likely to access those data soon so for such thing spoiling buffer cache may not be a good idea. -- Some people, including me, would say that the table will be accessed soon and that's why the data is loaded quickly during minimal maintenance hours. -- I was just suggesting only for experiments for identifying the root cause. -- I thought this is a good chance to possibly change things better (^^). I guess the user would simply think like this: "I just want to finish CTAS as quickly as possible, so I configured to take advantage of parallelism. I want CTAS to make most use of our resources. Why doesn't Postgres try to limit resource usage (by using the ring buffer) against my will?" Regards Takayuki Tsunakawa
Re: Race condition in recovery?
On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi wrote: > > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar wrote > in > > Maybe we can somehow achieve that without a broken archive command, > > but I am not sure how it is enough to just delete WAL from pg_wal? I > > mean my original case was that > > 1. Got the new history file from the archive but did not get the WAL > > file yet which contains the checkpoint after TL switch > > 2. So the standby2 try to stream using new primary using old TL and > > set the wrong TL in expectedTLEs > > > > But if you are not doing anything to stop archiving WAL files or to > > guarantee that WAL has come to archive and you deleted those then I am > > not sure how we are reproducing the original problem. > > Thanks for the reply! > > We're writing at the very beginning of the switching segment at the > promotion time. So it is guaranteed that the first segment of the > newer timline won't be archived until the rest almost 16MB in the > segment is consumed or someone explicitly causes a segment switch > (including archive timeout). I agree > > BTW, I have also tested your script and I found below log, which shows > > that standby2 is successfully able to select the timeline2 so it is > > not reproducing the issue. Am I missing something? > > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name > and uses "standby_1 and "cascade". In the ealier ones, standby_4 and > 5 (or 3 and 4 in the later versions) are used in ths additional tests. > > So I think it shold be something different? Yeah, I tested with your patch where you had a different test case, with "026_timeline_issue_2.pl", I am able to reproduce the issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Inserts in CREATE TABLE AS
On Thu, May 27, 2021 at 12:46 PM tsunakawa.ta...@fujitsu.com wrote: > > From: Dilip Kumar > Basically you are creating a new table and loading data to it and that means > you will be less likely to access those data soon so for such thing spoiling > buffer cache may not be a good idea. > -- > > Some people, including me, would say that the table will be accessed soon and > that's why the data is loaded quickly during minimal maintenance hours. > > > -- > I was just suggesting only for experiments for identifying the root cause. > -- > > I thought this is a good chance to possibly change things better (^^). > I guess the user would simply think like this: "I just want to finish CTAS as > quickly as possible, so I configured to take advantage of parallelism. I > want CTAS to make most use of our resources. Why doesn't Postgres try to > limit resource usage (by using the ring buffer) against my will?" If the idea is to give the user control of whether or not to use the separate RING BUFFER for bulk inserts/writes, then how about giving it as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable) and COPY. Furthermore, we could make the rel option an integer and allow users to provide the size of the ring buffer they want to choose for a particular bulk insert operation (of course with a max limit which is not exceeding the shared buffers or some reasonable amount not exceeding the RAM of the system). I think we can discuss this in a separate thread and see what other hackers think. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
RE: Parallel Inserts in CREATE TABLE AS
From: houzj.f...@fujitsu.com > Although, the 4 workers case still has performance degradation compared to > serial case. > > SERIAL: 58759.213 ms > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: > 58633.924 ms > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: > 66,960.305 ms Can you see any difference in table sizes? Regards Takayuki Tsunakawa
Re: Race condition in recovery?
At Thu, 27 May 2021 12:47:30 +0530, Dilip Kumar wrote in > On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi > wrote: > > > > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar > > wrote in > > We're writing at the very beginning of the switching segment at the > > promotion time. So it is guaranteed that the first segment of the > > newer timline won't be archived until the rest almost 16MB in the > > segment is consumed or someone explicitly causes a segment switch > > (including archive timeout). > > I agree > > > > BTW, I have also tested your script and I found below log, which shows > > > that standby2 is successfully able to select the timeline2 so it is > > > not reproducing the issue. Am I missing something? > > > > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name > > and uses "standby_1 and "cascade". In the ealier ones, standby_4 and > > 5 (or 3 and 4 in the later versions) are used in ths additional tests. > > > > So I think it shold be something different? > > Yeah, I tested with your patch where you had a different test case, > with "026_timeline_issue_2.pl", I am able to reproduce the issue. That said, I don't object if we decide to choose the crafted archive command as far as we consider the trade-offs between the two ways. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Parallel Inserts in CREATE TABLE AS
On Thu, May 27, 2021 at 1:03 PM tsunakawa.ta...@fujitsu.com wrote: > > From: houzj.f...@fujitsu.com > > Although, the 4 workers case still has performance degradation compared to > > serial case. > > > > SERIAL: 58759.213 ms > > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: > > 58633.924 ms > > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: > > 66,960.305 ms > > Can you see any difference in table sizes? Also, the number of pages the table occupies in each case along with table size would give more insights. I do as follows to get the number of pages a relation occupies: CREATE EXTENSION pgstattuple; SELECT pg_relpages('test'); With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
RE: Parallel Inserts in CREATE TABLE AS
From: Bharath Rupireddy > I think we can discuss this in a separate thread and see what other > hackers think. OK, unless we won't get stuck in the current direction. (Our goal is to not degrade in performance, but to outperform serial execution, isn't it?) > If the idea is to give the user control of whether or not to use the > separate RING BUFFER for bulk inserts/writes, then how about giving it > as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is > being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable) > and COPY. Furthermore, we could make the rel option an integer and > allow users to provide the size of the ring buffer they want to choose > for a particular bulk insert operation (of course with a max limit > which is not exceeding the shared buffers or some reasonable amount > not exceeding the RAM of the system). I think it's not a table property but an execution property. So, it'd be appropriate to control it with the SET command, just like the DBA sets work_mem and maintenance_work_mem for specific maintenance operations. I'll stop on this here... Regards Takayuki Tsunakawa
Re: Skipping logical replication transactions on subscriber side
On Thu, May 27, 2021 at 2:48 PM Amit Kapila wrote: > > On Thu, May 27, 2021 at 9:56 AM Masahiko Sawada wrote: > > > > On Wed, May 26, 2021 at 3:43 PM Amit Kapila wrote: > > > > > > On Tue, May 25, 2021 at 12:26 PM Masahiko Sawada > > > wrote: > > > > > > > > On Mon, May 24, 2021 at 7:51 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada > > > > > wrote: > > > > > > > > > > I think you need to consider few more things here: > > > > > (a) Say the error occurs after applying some part of changes, then > > > > > just skipping the remaining part won't be sufficient, we probably need > > > > > to someway rollback the applied changes (by rolling back the > > > > > transaction or in some other way). > > > > > > > > After more thought, it might be better to that setting and resetting > > > > the XID to skip requires disabling the subscription. > > > > > > > > > > It might be better if it doesn't require disabling the subscription > > > because it would be more steps for the user to disable/enable it. It > > > is not clear to me what exactly you want to gain by disabling the > > > subscription in this case. > > > > The situation I’m considered is where the user specifies the XID while > > the worker is applying the changes of the transaction with that XID. > > In this case, I think we need to somehow rollback the changes applied > > so far. Perhaps we can either rollback the transaction and ignore the > > remaining changes or restart and ignore the entire transaction from > > the beginning. > > > > If we follow your suggestion of only allowing XIDs that have been > known to have conflicts then probably we don't need to worry about > rollbacks. > > > > > > > > > > > > For (2), what I'm thinking is to add a new action to ALTER > > > > > > SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP > > > > > > TRANSACTION 590. Also, we can have actions to reset it; ALTER > > > > > > SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the > > > > > > XID to a new column of pg_subscription or a new catalog, having the > > > > > > worker reread its subscription information. Once the worker skipped > > > > > > the specified transaction, it resets the transaction to skip on the > > > > > > catalog. > > > > > > > > > > > > > > > > What if we fail while updating the reset information in the catalog? > > > > > Will it be the responsibility of the user to reset such a transaction > > > > > or we will retry it after restart of worker? Now, say, we give such a > > > > > responsibility to the user and the user forgets to reset it then there > > > > > is a possibility that after wraparound we will again skip the > > > > > transaction which is not intended. And, if we want to retry it after > > > > > restart of worker, how will the worker remember the previous failure? > > > > > > > > As described above, setting and resetting XID to skip is implemented > > > > as a normal system catalog change, so it's crash-safe and persisted. I > > > > think that the worker can either removes the XID or mark it as done > > > > once it skipped the specified transaction so that it won't skip the > > > > same XID again after wraparound. > > > > > > > > > > It all depends on when exactly you want to update the catalog > > > information. Say after skipping commit of the XID, we do update the > > > corresponding LSN to be communicated as already processed to the > > > subscriber and then get the error while updating the catalog > > > information then next time we might not know whether to update the > > > catalog for skipped XID. > > > > > > > Also, it might be better if we reset > > > > the XID also when a subscription field such as subconninfo is changed > > > > because it could imply the worker will connect to another publisher > > > > having a different XID space. > > > > > > > > We also need to handle the cases where the user specifies an old XID > > > > or XID whose transaction is already prepared on the subscriber. I > > > > think the worker can reset the XID with a warning when it finds out > > > > that the XID seems no longer valid or it cannot skip the specified > > > > XID. For example in the former case, it can do that when the first > > > > received transaction’s XID is newer than the specified XID. > > > > > > > > > > But how can we guarantee that older XID can't be received later? Is > > > there a guarantee that we receive the transactions on subscriber in > > > XID order. > > > > Considering the above two comments, it might be better to provide a > > way to skip the transaction that is already known to be conflicted > > rather than allowing users to specify the arbitrary XID. > > > > Okay, that makes sense but still not sure how will you identify if we > need to reset XID in case of failure doing that in the previous > attempt. It's a just idea but we can record the failed transaction with XID as well as its commit LSN passed? The sequence I'm thinking is, 1. the
Fix RADIUS error reporting in hba file parsing
The RADIUS-related checks in parse_hba_line() did not respect elevel and did not fill in *err_msg. Also, verify_option_list_length() pasted together error messages in an untranslatable way. To fix the latter, remove the function and do the error checking inline. It's a bit more verbose but only minimally longer, and it makes fixing the first two issues straightforward. From 063a8ae81d5ac33e8700bb0ea076d1499d36b655 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 27 May 2021 10:26:21 +0200 Subject: [PATCH] Fix RADIUS error reporting in hba file parsing The RADIUS-related checks in parse_hba_line() did not respect elevel and did not fill in *err_msg. Also, verify_option_list_length() pasted together error messages in an untranslatable way. To fix the latter, remove the function and do the error checking inline. It's a bit more verbose but only minimally longer, and it makes fixing the first two issues straightforward. --- src/backend/libpq/hba.c | 90 ++--- 1 file changed, 48 insertions(+), 42 deletions(-) diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index 60767f2957..3be8778d21 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -144,8 +144,6 @@ static List *tokenize_inc_file(List *tokens, const char *outer_filename, const char *inc_filename, int elevel, char **err_msg); static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int elevel, char **err_msg); -static bool verify_option_list_length(List *options, const char *optionname, - List *comparelist, const char *comparename, int line_num); static ArrayType *gethba_options(HbaLine *hba); static void fill_hba_line(Tuplestorestate *tuple_store, TupleDesc tupdesc, int lineno, HbaLine *hba, const char *err_msg); @@ -1607,21 +1605,23 @@ parse_hba_line(TokenizedLine *tok_line, int elevel) if (list_length(parsedline->radiusservers) < 1) { - ereport(LOG, + ereport(elevel, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("list of RADIUS servers cannot be empty"), errcontext("line %d of configuration file \"%s\"", line_num, HbaFileName))); + *err_msg = "list of RADIUS servers cannot be empty"; return NULL; } if (list_length(parsedline->radiussecrets) < 1) { - ereport(LOG, + ereport(elevel, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("list of RADIUS secrets cannot be empty"), errcontext("line %d of configuration file \"%s\"", line_num, HbaFileName))); + *err_msg = "list of RADIUS secrets cannot be empty"; return NULL; } @@ -1630,24 +1630,53 @@ parse_hba_line(TokenizedLine *tok_line, int elevel) * but that's already checked above), 1 (use the same value * everywhere) or the same as the number of servers. */ - if (!verify_option_list_length(parsedline->radiussecrets, - "RADIUS secrets", - parsedline->radiusservers, - "RADIUS servers", - line_num)) + if (!(list_length(parsedline->radiussecrets) == 1 || + list_length(parsedline->radiussecrets) == list_length(parsedline->radiusservers))) + { + ereport(elevel, + (errcode(ERRCODE_CONFIG_FILE_ERROR), +errmsg("the number of RADIUS secrets (%d) must be 1 or the same as the number of RADIUS servers (%d)", + list_length(parsedline->radiussecrets), + list_length(parsedline->radiusservers)), +errcontext("line %d of configuration file \"%s\"", + line_num, HbaFileName))); + *err_msg = psprintf("the numbe
Re: Skip partition tuple routing with constant partition key
Hou-san, On Thu, May 27, 2021 at 3:56 PM houzj.f...@fujitsu.com wrote: > From: Amit Langote > Sent: Thursday, May 27, 2021 1:54 PM > > On Thu, May 27, 2021 at 11:47 AM houzj.f...@fujitsu.com > > wrote: > > > About teaching relcache about caching the target partition. > > > > > > David-san suggested cache the partidx in PartitionDesc. > > > And it will need looping and checking the cached value at each level. > > > I was thinking can we cache a partidx list[1, 2 ,3], and then we can > > > follow the list to get the last partition and do the partition CHECK > > > only for the last partition. If any unexpected thing happen, we can > > > return to the original table and redo the tuple routing without using the > > cached index. > > > What do you think ? > > > > Where are you thinking to cache the partidx list? Inside PartitionDesc or > > some > > executor struct? > > I was thinking cache the partidx list in PartitionDescData which is in > relcache, if possible, we can > use the cached partition between statements. Ah, okay. I thought you were talking about a different idea. How and where would you determine that a cached partidx value is indeed the correct one for a given row? Anyway, do you want to try writing a patch to see how it might work? -- Amit Langote EDB: http://www.enterprisedb.com
Re: Parallel Inserts in CREATE TABLE AS
On Thu, May 27, 2021 at 10:27 AM Dilip Kumar wrote: > > On Thu, May 27, 2021 at 10:16 AM Bharath Rupireddy > wrote: > > > > On Thu, May 27, 2021 at 7:12 AM houzj.f...@fujitsu.com > > wrote: > > > I am afraid that the using the FSM seems not get a stable performance > > > gain(at least on my machine), > > > I will take a deep look into this to figure out the difference. A naive > > > idea it that the benefit that bulk extension > > > bring is not much greater than the cost in FSM. > > > Do you have some ideas on it ? > > > > I think, if we try what Amit and I said in [1], we should get some > > insights on whether the bulk relation extension is taking more time or > > the FSM lookup. I plan to share the testing patch adding the timings > > and the counters so that you can also test from your end. I hope > > that's fine with you. > > I think some other cause of contention on relation extension locks are > 1. CTAS is using a buffer strategy and due to that, it might need to > evict out the buffer frequently for getting the new block in. Maybe > we can identify by turning off the buffer strategy for CTAS and > increasing the shared buffer so that data fits in memory. > One more thing to ensure is whether all the workers are using the same access strategy? -- With Regards, Amit Kapila.
Re: storing an explicit nonce
Greetings, On Thu, May 27, 2021 at 4:52 PM Bruce Momjian wrote: > > > > I am confused why checksums, which are widely used, acceptably require > > wal_log_hints, but there is concern that file encryption, which is > > heavier, cannot acceptably require wal_log_hints. I must be missing > > something. > > > > Why can't checksums also throw away hint bit changes like you want to do > > for file encryption and not require wal_log_hints? > > I'm really confused about it, too. I read the above communication, not sure if my understanding is correct... What we are facing is not only the change of flag such as *pd_flags*, but also others like pointer array changes in btree like Robert said. We don't need them to write a WAL record. I have an immature idea, could we use LSN+blkno+checksum as the nonce when the checksum enabled? And when the checksum disabled, we just use a global counter to generate a number as the fake checksum value... Then we also use LSN+blkno+fake_checksum as the nonce. Is there anything wrong with that? -- There is no royal road to learning. HighGo Software Co.
Re: Parallel Inserts in CREATE TABLE AS
On Thu, May 27, 2021 at 2:26 PM Amit Kapila wrote: > > I think some other cause of contention on relation extension locks are > > 1. CTAS is using a buffer strategy and due to that, it might need to > > evict out the buffer frequently for getting the new block in. Maybe > > we can identify by turning off the buffer strategy for CTAS and > > increasing the shared buffer so that data fits in memory. > > > > One more thing to ensure is whether all the workers are using the same > access strategy? In the Parallel Inserts in CTAS patches, the leader and each worker uses its own ring buffer of 16MB i.e. does myState->bistate = GetBulkInsertState(); separately. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Skipping logical replication transactions on subscriber side
On Thu, May 27, 2021 at 1:46 PM Masahiko Sawada wrote: > > On Thu, May 27, 2021 at 2:48 PM Amit Kapila wrote: > > > > Okay, that makes sense but still not sure how will you identify if we > > need to reset XID in case of failure doing that in the previous > > attempt. > > It's a just idea but we can record the failed transaction with XID as > well as its commit LSN passed? The sequence I'm thinking is, > > 1. the worker records the XID and commit LSN of the failed transaction > to a catalog. > When will you record this info? I am not sure if we can try to update this when an error has occurred. We can think of using try..catch in apply worker and then record it in catch on error but would that be advisable? One random thought that occurred to me is to that apply worker notifies such information to the launcher (or maybe another process) which will log this information. > 2. the user specifies how to resolve that conflict transaction > (currently only 'skip' is supported) and writes to the catalog. > 3. the worker does the resolution method according to the catalog. If > the worker didn't start to apply those changes, it can skip the entire > transaction. If did, it rollbacks the transaction and ignores the > remaining. > > The worker needs neither to reset information of the last failed > transaction nor to mark the conflicted transaction as resolved. The > worker will ignore that information when checking the catalog if the > commit LSN is passed. > So won't this require us to check the required info in the catalog before applying each transaction? If so, that might be overhead, maybe we can build some cache of the highest commitLSN that can be consulted rather than the catalog table. I think we need to think about when to remove rows for which conflict has been resolved as we can't let that information grow infinitely. > > Also, I am thinking that instead of a stat view, do we need > > to consider having a system table (pg_replication_conflicts or > > something like that) for this because what if stats information is > > lost (say either due to crash or due to udp packet loss), can we rely > > on stats view for this? > > Yeah, it seems better to use a catalog. > Okay. -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Thu, May 27, 2021 at 12:01 PM Masahiko Sawada wrote: > > On Wed, May 26, 2021 at 6:11 PM Amit Kapila wrote: > > > > I agree with you that specifying XID could be easier and > > understandable for users. I was thinking and studying a bit about what > > other systems do in this regard. Why don't we try to provide conflict > > resolution methods for users? The idea could be that either the > > conflicts can be resolved automatically or manually. In the case of > > manual resolution, users can use the existing methods or the XID stuff > > you are proposing here and in case of automatic resolution, the > > in-built or corresponding user-defined functions will be invoked for > > conflict resolution. There are more details to figure out in the > > automatic resolution scheme but I see a lot of value in doing the > > same. > > Yeah, I also see a lot of value in automatic conflict resolution. But > maybe we can have both ways? For example, in case where the user wants > to resolve conflicts in different ways or a conflict that cannot be > resolved by automatic resolution (not sure there is in practice > though), the manual resolution would also have value. > Right, that is exactly what I was saying. So, even if both can be done as separate patches, we should try to design the manual resolution in a way that can be extended for an automatic resolution system. I think we can try to have some initial idea/design/POC for an automatic resolution as well to ensure that the manual resolution scheme can be further extended. -- With Regards, Amit Kapila.
Re: logical replication empty transactions
On Tue, May 25, 2021 at 6:36 PM Ajin Cherian wrote: > > On Tue, Apr 27, 2021 at 1:49 PM Ajin Cherian wrote: > > Rebased the patch as it was no longer applying. Thanks for the updated patch, few comments: 1) I'm not sure if we could add some tests for skip empty transactions, if possible add a few tests. 2) We could add some debug level log messages for the transaction that will be skipped. 3) You could keep this variable below the other bool variables in the structure: + boolsent_begin_txn; /* flag indicating whether begin + * has already been sent */ + 4) You can split the comments to multi-line as it exceeds 80 chars + /* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */ + if (!data->sent_begin_txn && !in_streaming && transactional) + pgoutput_begin(ctx, txn); Regards, Vignesh
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Tue, May 25, 2021 at 4:39 PM Paul Guo wrote: > > Hi hackers, > > I found this when reading the related code. Here is the scenario: > > bool > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, > bool retryOnError) > > For the case retryOnError is true, the function would in loop call > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), > we can see if we run into the below branch, RegisterSyncRequest() will > need to loop until the checkpointer absorbs the existing requests so > ForwardSyncRequest() might hang for some time until a checkpoint > request is triggered. This scenario seems to be possible in theory > though the chance is not high. It seems like a really unlikely scenario, but maybe possible if you use a lot of unlogged tables maybe (as you could eventually dirty/evict more than NBuffers buffers without triggering enough WALs activity to trigger a checkpoint with any sane checkpoint configuration). > ForwardSyncRequest(): > > if (CheckpointerShmem->checkpointer_pid == 0 || > (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && > !CompactCheckpointerRequestQueue())) > { > /* > * Count the subset of writes where backends have to do their own > * fsync > */ > if (!AmBackgroundWriterProcess()) > CheckpointerShmem->num_backend_fsync++; > LWLockRelease(CheckpointerCommLock); > return false; > } > > One fix is to add below similar code in RegisterSyncRequest(), trigger > a checkpoint for the scenario. > > // checkpointer_triggered: variable for one trigger only. > if (!ret && retryOnError && ProcGlobal->checkpointerLatch && > !checkpointer_triggered) > SetLatch(ProcGlobal->checkpointerLatch); > > Any comments? It looks like you intended to set the checkpointer_triggered var but didn't. Also this will wake up the checkpointer but won't force a checkpoint (unlike RequestCheckpoint()). It may be a good thing though as it would only absorb the requests and go back to sleep if no other threshold is reachrf. Apart from the implementation details it seems like it could help in this unlikely event.
RE: Parallel Inserts in CREATE TABLE AS
From: Bharath Rupireddy Sent: Thursday, May 27, 2021 3:41 PM > > On Thu, May 27, 2021 at 1:03 PM tsunakawa.ta...@fujitsu.com > wrote: > > > > From: houzj.f...@fujitsu.com > > > Although, the 4 workers case still has performance degradation > > > compared to serial case. > > > > > > SERIAL: 58759.213 ms > > > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: > > > 58633.924 ms > > > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: > > > 66,960.305 ms > > > > Can you see any difference in table sizes? > > Also, the number of pages the table occupies in each case along with table > size > would give more insights. > > I do as follows to get the number of pages a relation occupies: > CREATE EXTENSION pgstattuple; > SELECT pg_relpages('test'); It seems the difference between SKIP FSM and NOT SKIP FSM is not big. I tried serval times and the average result is almost the same. pg_relpages - 1428575 pg_relation_size - 11702976512(11G) Best regards, houzj
security_definer_search_path GUC
Hi, Since writing SECURITY DEFINER functions securely requires annoying incantations[1], wouldn't it be nice if we provided a way for the superuser to override the default search path via a GUC in postgresql.conf? That way you can set search_path if you want to override the default, but if you leave it out you're not vulnerable, assuming security_definer_search_path only contains secure schemas. .m
Re: Add PortalDrop in exec_execute_message
Alvaro Herrera писал 2021-05-26 23:59: On 2021-May-25, Yura Sokolov wrote: Tom Lane писал 2021-05-21 21:23: > Yura Sokolov writes: > > I propose to add PortalDrop at the 'if (completed)' branch of > > exec_execute_message. > > This violates our wire protocol specification, which > specifically says > > If successfully created, a named portal object lasts till the end of > the current transaction, unless explicitly destroyed. An unnamed > portal is destroyed at the end of the transaction, or as soon as the > next Bind statement specifying the unnamed portal as destination is > issued. (Note that a simple Query message also destroys the unnamed > portal.) > > I'm inclined to think that your complaint would be better handled > by having the client send a portal-close command, if it's not > going to do something else immediately. I thought about it as well. Then, if I understand correctly, PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send "close portal" (CP) message after "execute" message, right? I don't think they should do that. The portal remains open, and the libpq interface does that. The portal gets closed at end of transaction without the need for any client message. I think if the client wanted to close the portal ahead of time, it would need a new libpq entry point to send the message to do that. - PQsendQuery issues Query message, and exec_simple_query closes its portal. - people doesn't expect PQsendQueryParams to be different from PQsendQuery aside of parameter sending. The fact that the portal remains open is a significant, unexpected and undesired difference. - PQsendQueryGuts is used in PQsendQueryParams and PQsendQueryPrepared. It is always sends empty portal name and always "send me all rows" limit (zero). Both usages are certainly to just perform query and certainly no one expects portal remains open. (I didn't add a Close Portal message to PQsendQueryInternal in pipeline mode precisely because there is no such message in PQsendQueryGuts. But PQsendQueryInternal should replicate behavior of PQsendQuery and not PQsendQueryParams. Despite it has to use new protocol, it should be indistinguishable to user, therefore portal should be closed. I think it would be wrong to unconditionally add a Close Portal message to any of those places.) Why? If you foresee problems, please share your mind. regards, Sokolov Yura aka funny_falcon
Re: Add PortalDrop in exec_execute_message
Tom Lane wrote 2021-05-27 00:19: Alvaro Herrera writes: (I didn't add a Close Portal message to PQsendQueryInternal in pipeline mode precisely because there is no such message in PQsendQueryGuts. I think it would be wrong to unconditionally add a Close Portal message to any of those places.) Yeah, I'm not very comfortable with having libpq take it on itself to do that, either. But... Tom Lane wrote 2021-05-21 21:23: I'm inclined to think that your complaint would be better handled by having the client send a portal-close command, if it's not going to do something else immediately. And given PQsendQueryParams should not be different from PQsendQuery (aside of parameters sending) why shouldn't it close its portal immediately, like it happens in exec_simple_query ? I really doubt user of PQsendQueryPrepared is aware of portal as well since it is also unnamed and also exhausted (because PQsendQueryGuts always sends "send me all rows" limit). And why PQsendQueryInternal should behave differently in pipelined and not pipelined mode? It closes portal in not pipelined mode, and will not close portal of last query in pipelined mode (inside of transaction). Looking back at the original complaint, it seems like it'd be fair to wonder why we're still holding a page pin in a supposedly completed executor run. Maybe the right fix is somewhere in the executor scan logic. Perhaps because query is simple and portal is created as seek-able? regards, tom lane regards Yura Sokolov
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Thu, May 27, 2021 at 12:11 AM Tom Lane wrote: > AFAIR, there are zero promises about how effective, or when effective, > changes in SET STORAGE will be. And the number of complaints about > that has also been zero. So I'm not sure why we need to do more for > SET COMPRESSION. Especially since I'm unconvinced that recompressing > everything just to recompress everything would *ever* be worthwhile. I think it is good to have *some* way of ensuring that what you want the system to do, it is actually doing. If we have not a single operation in the system anywhere that can force recompression, someone who actually cares will be left with no option but a dump and reload. That is probably both a whole lot slower than something in the server itself and also a pretty silly thing to have to tell people to do. If it helps, I'd be perfectly fine with having this be an *optional* behavior for CLUSTER or VACUUM FULL, depending on some switch. Or we can devise another way for the user to make it happen. But we shouldn't just be setting a policy that users are not allowed to care whether their data is actually compressed using the compression method they specified. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Bracket, brace, parenthesis
On Thu, May 27, 2021 at 03:20:10PM +0900, Kyotaro Horiguchi wrote: > At Fri, 14 May 2021 10:04:57 -0400, Tom Lane wrote in >> +1. I tend to write "square bracket" or "curly brace" when I want to >> be extra clear, but I think the bare terms are widely understood to >> have those meanings. > > Thanks! I think the message is new in 14 so we can fix it right > away. The attached is the version with a commit message added. No objections from me to fix that on HEAD now for clarity, let's wait a bit and see if others have more comments. You have missed an update of multirangetypes.out, by the way. -- Michael signature.asc Description: PGP signature
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Robert Haas writes: > On Thu, May 27, 2021 at 12:11 AM Tom Lane wrote: >> AFAIR, there are zero promises about how effective, or when effective, >> changes in SET STORAGE will be. And the number of complaints about >> that has also been zero. So I'm not sure why we need to do more for >> SET COMPRESSION. Especially since I'm unconvinced that recompressing >> everything just to recompress everything would *ever* be worthwhile. > I think it is good to have *some* way of ensuring that what you want > the system to do, it is actually doing. If we have not a single > operation in the system anywhere that can force recompression, someone > who actually cares will be left with no option but a dump and reload. > That is probably both a whole lot slower than something in the server > itself and also a pretty silly thing to have to tell people to do. [ shrug... ] I think the history of the SET STORAGE option teaches us that there is no such requirement, and you're inventing a scenario that doesn't exist in the real world. regards, tom lane
Re: pg_rewind fails if there is a read only file.
On Wed, May 26, 2021 at 10:32 PM Andrew Dunstan wrote: > > > On 5/26/21 2:43 AM, Laurenz Albe wrote: > > On Wed, 2021-05-26 at 08:57 +0900, Michael Paquier wrote: > >> On Tue, May 25, 2021 at 03:17:37PM -0400, Andrew Dunstan wrote: > >>> If we do decide to do something the question arises what should it do? > >>> If we're to allow for it I'm wondering if the best thing would be simply > >>> to ignore such a file. > >> Enforcing assumptions that any file could be ready-only is a very bad > >> idea, as that could lead to weird behaviors if a FS is turned as > >> becoming read-only suddenly while doing a rewind. Another idea that > >> has popped out across the years was to add an option to pg_rewind so > >> as users could filter files manually. That could be easily dangerous > >> though in the wrong hands, as one could think that it is a good idea > >> to skip a control file, for example. > >> > >> The thing is that here we actually know the set of files we'd like to > >> ignore most of the time, and we still want to have some automated > >> control what gets filtered. So here is a new idea: we build a list of > >> files based on a set of GUC parameters using postgres -C on the target > >> data folder, and assume that these are safe enough to be skipped all > >> the time, if these are in the data folder. > > That sounds complicated, but should work. > > There should be a code comment somewhere that warns people not to forget > > to look at that when they add a new GUC. > > > > I can think of two alternatives to handle this: > > > > - Skip files that cannot be opened for writing and issue a warning. > > That is simple, but coarse. > > A slightly more sophisticated version would first check if files > > are the same on both machines and skip the warning for those. > > > > - Paul's idea to try and change the mode on the read-only file > > and reset it to the original state after pg_rewind is done. > > > > How about we only skip (with a warning) read-only files if they are in > the data root, not a subdirectory? That would save us from silently The assumption seems to limit the user scenario. > ignoring read-only filesystems and not involve using a GUC. How about this: By default, change and reset the file mode before and after open() for read only files, but we also allow to pass skip-file names to pg_rewind (e.g. pg_rewind --skip filenameN or --skip-list-file listfile) in case users really want to skip some files (probably not just read only files). Presumably the people who run pg_rewind should be DBA or admin that have basic knowledge of this so we should not worry too much about that the user skips some important files (we could even set a deny list for this). Also this solution works fine for a read only file system since pg_rewind soon quits when adding the write permission for any read only file. -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud wrote: > > On Tue, May 25, 2021 at 4:39 PM Paul Guo wrote: > > > > Hi hackers, > > > > I found this when reading the related code. Here is the scenario: > > > > bool > > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, > > bool retryOnError) > > > > For the case retryOnError is true, the function would in loop call > > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), > > we can see if we run into the below branch, RegisterSyncRequest() will > > need to loop until the checkpointer absorbs the existing requests so > > ForwardSyncRequest() might hang for some time until a checkpoint > > request is triggered. This scenario seems to be possible in theory > > though the chance is not high. > > It seems like a really unlikely scenario, but maybe possible if you > use a lot of unlogged tables maybe (as you could eventually > dirty/evict more than NBuffers buffers without triggering enough WALs > activity to trigger a checkpoint with any sane checkpoint > configuration). RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for buffer sync. > > ForwardSyncRequest(): > > > > if (CheckpointerShmem->checkpointer_pid == 0 || > > (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests > > && > > !CompactCheckpointerRequestQueue())) > > { > > /* > > * Count the subset of writes where backends have to do their own > > * fsync > > */ > > if (!AmBackgroundWriterProcess()) > > CheckpointerShmem->num_backend_fsync++; > > LWLockRelease(CheckpointerCommLock); > > return false; > > } > > > > One fix is to add below similar code in RegisterSyncRequest(), trigger > > a checkpoint for the scenario. > > > > // checkpointer_triggered: variable for one trigger only. > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch && > > !checkpointer_triggered) > > SetLatch(ProcGlobal->checkpointerLatch); > > > > Any comments? > > It looks like you intended to set the checkpointer_triggered var but Yes this is just pseduo code. > didn't. Also this will wake up the checkpointer but won't force a > checkpoint (unlike RequestCheckpoint()). It may be a good thing I do not expect an immediate checkpoint. AbsorbSyncRequests() is enough since after that RegisterSyncRequest() could finish. > though as it would only absorb the requests and go back to sleep if no > other threshold is reachrf. Apart from the implementation details it > seems like it could help in this unlikely event. -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 9:59 PM Paul Guo wrote: > > On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud wrote: > > > > On Tue, May 25, 2021 at 4:39 PM Paul Guo wrote: > > > > > > Hi hackers, > > > > > > I found this when reading the related code. Here is the scenario: > > > > > > bool > > > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, > > > bool retryOnError) > > > > > > For the case retryOnError is true, the function would in loop call > > > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), > > > we can see if we run into the below branch, RegisterSyncRequest() will > > > need to loop until the checkpointer absorbs the existing requests so > > > ForwardSyncRequest() might hang for some time until a checkpoint > > > request is triggered. This scenario seems to be possible in theory > > > though the chance is not high. > > > > It seems like a really unlikely scenario, but maybe possible if you > > use a lot of unlogged tables maybe (as you could eventually > > dirty/evict more than NBuffers buffers without triggering enough WALs > > activity to trigger a checkpoint with any sane checkpoint > > configuration). > > RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and > SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for > buffer sync. > > > > ForwardSyncRequest(): > > > > > > if (CheckpointerShmem->checkpointer_pid == 0 || > > > (CheckpointerShmem->num_requests >= > > > CheckpointerShmem->max_requests && > > > !CompactCheckpointerRequestQueue())) > > > { > > > /* > > > * Count the subset of writes where backends have to do their own > > > * fsync > > > */ > > > if (!AmBackgroundWriterProcess()) > > > CheckpointerShmem->num_backend_fsync++; > > > LWLockRelease(CheckpointerCommLock); > > > return false; > > > } > > > > > > One fix is to add below similar code in RegisterSyncRequest(), trigger > > > a checkpoint for the scenario. > > > > > > // checkpointer_triggered: variable for one trigger only. > > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch && > > > !checkpointer_triggered) > > > SetLatch(ProcGlobal->checkpointerLatch); > > > > > > Any comments? > > > > It looks like you intended to set the checkpointer_triggered var but > > Yes this is just pseduo code. > > > didn't. Also this will wake up the checkpointer but won't force a > > checkpoint (unlike RequestCheckpoint()). It may be a good thing > > I do not expect an immediate checkpoint. AbsorbSyncRequests() > is enough since after that RegisterSyncRequest() could finish. > > > though as it would only absorb the requests and go back to sleep if no > > other threshold is reachrf. Apart from the implementation details it > > seems like it could help in this unlikely event. > Also note that ForwardSyncRequest() does wake up the checkpointer if it thinks the requests in shared memory are "too full", but does not wake up when the request is actually full. This does not seem to be reasonable. See below code in ForwardSyncRequest /* If queue is more than half full, nudge the checkpointer to empty it */ too_full = (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests / 2); /* ... but not till after we release the lock */ if (too_full && ProcGlobal->checkpointerLatch) SetLatch(ProcGlobal->checkpointerLatch); -- Paul Guo (Vmware)
Re: Parallel Inserts in CREATE TABLE AS
On Thu, May 27, 2021 at 9:53 AM Bharath Rupireddy wrote: > > One idea to find this out could be that we have three counters for > > each worker which counts the number of times each worker extended the > > relation in bulk, the number of times each worker extended the > > relation by one block, the number of times each worker gets the page > > from FSM. It might be possible that with this we will be able to > > figure out why there is a difference between your and Hou-San's > > results. > > Yeah, that helps. And also, the time spent in > LockRelationForExtension, ConditionalLockRelationForExtension, > GetPageWithFreeSpace and RelationAddExtraBlocks too can give some > insight. > > My plan is to have a patch with above info added in (which I will > share it here so that others can test and see the results too) and run > the "case 4" where there's a regression seen on my system. I captured below information with the attached patch 0001-test-times-and-block-counts.patch applied on top of CTAS v23 patch set. Testing details are attached in the file named "test". Total time spent in LockRelationForExtension Total time spent in GetPageWithFreeSpace Total time spent in RelationAddExtraBlocks Total number of times extended the relation in bulk Total number of times extended the relation by one block Total number of blocks added in bulk extension Total number of times getting the page from FSM Here is a summary of what I observed: 1) The execution time with 2 workers, without TABLE_INSERT_SKIP_FSM (140 sec) is more than with 0 workers (112 sec) 2) The execution time with 2 workers, with TABLE_INSERT_SKIP_FSM (225 sec) is more than with 2 workers, without TABLE_INSERT_SKIP_FSM (140 sec) 3) Majority of the time is going into waiting for relation extension lock in LockRelationForExtension. With 2 workers, without TABLE_INSERT_SKIP_FSM, out of total execution time 140 sec, the time spent in LockRelationForExtension is ~40 sec and the time spent in RelationAddExtraBlocks is ~20 sec. So, ~60 sec are being spent in these two functions. With 2 workers, with TABLE_INSERT_SKIP_FSM, out of total execution time 225 sec, the time spent in LockRelationForExtension is ~135 sec and the time spent in RelationAddExtraBlocks is 0 sec (because we skip FSM, no bulk extend logic applies). So, most of the time is being spent in LockRelationForExtension. I'm still not sure why the execution time with 0 workers (or serial execution or no parallelism involved) on my testing system is 112 sec compared to 58 sec on Hou-San's system for the same use case. Maybe the testing system I'm using is not of the latest configuration compared to others. Having said that, I request others to try and see if the same observations (as above) are made on their testing systems for the same use case. If others don't see regression (with just 2 workers) or they observe not much difference with and without TABLE_INSERT_SKIP_FSM. I'm open to changing the parallel inserts in CTAS code to use TABLE_INSERT_SKIP_FSM. In any case, if the observation is that there's a good amount of time being spent in LockRelationForExtension, I'm not sure (at this point) whether we can do something here or just live with it. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From 105495d5aa6b25f61c812dd5cb9ff692abe38373 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 27 May 2021 13:37:32 +0530 Subject: [PATCH] test times and block counts --- src/backend/access/heap/hio.c | 21 src/backend/commands/createas.c | 41 +++ src/backend/storage/freespace/freespace.c | 22 +++- src/backend/storage/lmgr/lmgr.c | 13 +++ src/include/commands/createas.h | 17 ++ 5 files changed, 113 insertions(+), 1 deletion(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index d34edb4190..d737c2d763 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -19,6 +19,8 @@ #include "access/hio.h" #include "access/htup_details.h" #include "access/visibilitymap.h" +#include "commands/createas.h" +#include "portability/instr_time.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h" @@ -198,12 +200,17 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) firstBlock = InvalidBlockNumber; int extraBlocks; int lockWaiters; + instr_time start; + instr_time end; /* Use the length of the lock wait queue to judge how much to extend. */ lockWaiters = RelationExtensionLockWaiterCount(relation); if (lockWaiters <= 0) return; + if (is_ctas) + INSTR_TIME_SET_CURRENT(start); + /* * It might seem like multiplying the number of lock waiters by as much as * 20 is too aggressive, but benchmarking revealed that smaller numbers @@ -212,6 +219,9 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) */ extraBlocks = Min(512, lockWaite
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Thu, May 27, 2021 at 7:04 PM Tom Lane wrote: > > Robert Haas writes: > > On Thu, May 27, 2021 at 12:11 AM Tom Lane wrote: > >> AFAIR, there are zero promises about how effective, or when effective, > >> changes in SET STORAGE will be. And the number of complaints about > >> that has also been zero. So I'm not sure why we need to do more for > >> SET COMPRESSION. Especially since I'm unconvinced that recompressing > >> everything just to recompress everything would *ever* be worthwhile. > > > I think it is good to have *some* way of ensuring that what you want > > the system to do, it is actually doing. If we have not a single > > operation in the system anywhere that can force recompression, someone > > who actually cares will be left with no option but a dump and reload. > > That is probably both a whole lot slower than something in the server > > itself and also a pretty silly thing to have to tell people to do. > > [ shrug... ] I think the history of the SET STORAGE option teaches us > that there is no such requirement, and you're inventing a scenario that > doesn't exist in the real world. But can we compare SET STORAGE with SET compression? I mean storage just controls how the data are stored internally and there is no external dependency. But if we see the compression it will have a dependency on the external library. So if the user wants to get rid of the dependency on the external library then IMHO, there should be some way to do it by recompressing all the data. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 9:59 PM Paul Guo wrote: > > > It seems like a really unlikely scenario, but maybe possible if you > > use a lot of unlogged tables maybe (as you could eventually > > dirty/evict more than NBuffers buffers without triggering enough WALs > > activity to trigger a checkpoint with any sane checkpoint > > configuration). > > RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and > SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for > buffer sync. I know, but the checkpointer can hold up to NBuffers requests, so I highly doubt that you can end up filling the buffer with those. > > > ForwardSyncRequest(): > > > > > > if (CheckpointerShmem->checkpointer_pid == 0 || > > > (CheckpointerShmem->num_requests >= > > > CheckpointerShmem->max_requests && > > > !CompactCheckpointerRequestQueue())) > > > { > > > /* > > > * Count the subset of writes where backends have to do their own > > > * fsync > > > */ > > > if (!AmBackgroundWriterProcess()) > > > CheckpointerShmem->num_backend_fsync++; > > > LWLockRelease(CheckpointerCommLock); > > > return false; > > > } > > > > > > One fix is to add below similar code in RegisterSyncRequest(), trigger > > > a checkpoint for the scenario. > > > > > > // checkpointer_triggered: variable for one trigger only. > > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch && > > > !checkpointer_triggered) > > > SetLatch(ProcGlobal->checkpointerLatch); > > > > > > Any comments? > > > > It looks like you intended to set the checkpointer_triggered var but > > Yes this is just pseduo code. > > > didn't. Also this will wake up the checkpointer but won't force a > > checkpoint (unlike RequestCheckpoint()). It may be a good thing > > I do not expect an immediate checkpoint. AbsorbSyncRequests() > is enough since after that RegisterSyncRequest() could finish. You said "trigger a checkpoint", which sounded more like forcing a checkpointer rather than waking up the checkpointer so that it can absorb the pending requests, so it seems worth to mention what it would really do.
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
> You said "trigger a checkpoint", which sounded more like forcing a > checkpointer rather than waking up the checkpointer so that it can > absorb the pending requests, so it seems worth to mention what it > would really do. Yes it is not accurate. Thanks for the clarification. -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 10:05 PM Paul Guo wrote: > > Also note that ForwardSyncRequest() does wake up the checkpointer if > it thinks the requests in shared memory are "too full", but does not > wake up when the request is actually full. This does not seem to be > reasonable. > See below code in ForwardSyncRequest > > /* If queue is more than half full, nudge the checkpointer to empty it */ > too_full = (CheckpointerShmem->num_requests >= > CheckpointerShmem->max_requests / 2); > > /* ... but not till after we release the lock */ > if (too_full && ProcGlobal->checkpointerLatch) > SetLatch(ProcGlobal->checkpointerLatch); Ah indeed. Well it means that the checkpointer it woken up early enough to avoid reaching that point. I'm not sure that it's actually possible to reach a point where the list if full and the checkpointer is sitting idle.
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Thu, May 27, 2021 at 10:18 AM Dilip Kumar wrote: > > [ shrug... ] I think the history of the SET STORAGE option teaches us > > that there is no such requirement, and you're inventing a scenario that > > doesn't exist in the real world. > > But can we compare SET STORAGE with SET compression? I mean storage > just controls how the data are stored internally and there is no > external dependency. But if we see the compression it will have a > dependency on the external library. So if the user wants to get rid > of the dependency on the external library then IMHO, there should be > some way to do it by recompressing all the data. TBH, I'm more concerned about the other direction. Surely someone who upgrades from an existing release to v14 and sets their compression method to lz4 is going to want a way of actually converting their data to using lz4. To say that nobody cares about that is to deem the feature useless. Maybe that's what Tom thinks, but it's not what I think. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Speed up pg_checksums in cases where checksum already set
Thanks for the quick replies, everyone. On Wed, May 26, 2021 at 10:17 PM Michael Paquier wrote: > > -if (do_sync) > +if (do_sync && total_files_modified) > Here, I am on the edge. It could be an advantage to force a flush of > the data folder anyway, no? I was originally on the fence about including this as well, but it seems like since the database is shut down and already in a consistent state, there seems no advantage to syncing if we have not made any changes. Things are no better or worse than when we arrived. However, the real-world use case of running pg_checksums --enable and getting no changed blocks is probably fairly rare, so if there is a strong objection, I'm happy reverting to just (do_sync). (I'm not sure how cheap a sync is, I assume it's low impact as the database is shut down, I guess it becomes a "might as well while we are here"?) On Wed, May 26, 2021 at 10:29 PM Justin Pryzby wrote: > In one of the checksum patches, there was an understanding that the pages > should be written even if the checksum is correct, to handle replicas. > ... > Does your patch complicate things for the "stop all the clusters before > switching them all" case? > I cannot imagine how it would, but, like Michael, I'm not really understanding the reasoning here. We only run when safely shutdown, so no WAL or dirty buffers need concern us :). Of course, once the postmaster is up and running, fiddling with checksums becomes vastly more complicated, as evidenced by that thread. I'm happy sticking to and speeding up the offline version for now. Cheers, Greg
Re: Reducing the range of OIDs consumed by genbki.pl
On Wed, May 26, 2021 at 5:43 PM Tom Lane wrote: > So I propose shoehorning this into v14, to avoid shipping a > release where FirstBootstrapObjectId has been bumped up. Just to repeat on this thread what I said on the other one, I am +1 on this as a concept. I have not reviewed the patch. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Robert Haas writes: > On Thu, May 27, 2021 at 10:18 AM Dilip Kumar wrote: >>> [ shrug... ] I think the history of the SET STORAGE option teaches us >>> that there is no such requirement, and you're inventing a scenario that >>> doesn't exist in the real world. >> But can we compare SET STORAGE with SET compression? I mean storage >> just controls how the data are stored internally and there is no >> external dependency. But if we see the compression it will have a >> dependency on the external library. So if the user wants to get rid >> of the dependency on the external library then IMHO, there should be >> some way to do it by recompressing all the data. > TBH, I'm more concerned about the other direction. Surely someone who > upgrades from an existing release to v14 and sets their compression > method to lz4 is going to want a way of actually converting their data > to using lz4. To say that nobody cares about that is to deem the > feature useless. Maybe that's what Tom thinks, but it's not what I > think. What I'm hearing is a whole lot of hypothesizing and zero evidence of actual field requirements. On the other side of the coin, we've already wasted significant person-hours on fixing this feature's memory leakage, and now people are proposing to expend more effort on solving^Wpapering over its performance issues by adding yet more user-visible complication. It's already adding too much user-visible complication IMO --- I know because I was just copy-editing the documentation about that yesterday. I say it's time to stop the bleeding and rip it out. When and if there are actual field requests to have a way to do this, we can discuss what's the best way to respond to those requests. Hacking VACUUM probably isn't the best answer, anyway. But right now, we are past feature freeze, and I think we ought to jettison this one rather than quickly kluge something. regards, tom lane
Re: storing an explicit nonce
On Wed, May 26, 2021 at 4:40 PM Bruce Momjian wrote: > You are saying that by using a non-LSN nonce, you can write out the page > with a new nonce, but the same LSN, and also discard the page during > crash recovery and use the WAL copy? I don't know what "discard the page during crash recovery and use the WAL copy" means. > I am confused why checksums, which are widely used, acceptably require > wal_log_hints, but there is concern that file encryption, which is > heavier, cannot acceptably require wal_log_hints. I must be missing > something. I explained this in the first complete paragraph of my first email with this subject line: "For example, right now, we only need to WAL log hints for the first write to each page after a checkpoint, but in this approach, if the same page is written multiple times per checkpoint cycle, we'd need to log hints every time." That's a huge difference. Page eviction in some workloads can push the same pages out of shared buffers every few seconds, whereas something that has to be done once per checkpoint cycle cannot affect each page nearly so often. A checkpoint is only going to occur every 5 minutes by default, or more realistically every 10-15 minutes in a well-tuned production system. In other words, we're not holding up some kind of double standard, where the existing feature is allowed to depend on doing a certain thing but your feature isn't allowed to depend on the same thing. Your design depends on doing something which is potentially 100x+ more expensive than the existing thing. It's not always going to be that expensive, but it can be. > Why can't checksums also throw away hint bit changes like you want to do > for file encryption and not require wal_log_hints? Well, I don't want to throw away hint bit changes, just like we don't throw them away right now. And I want to do that by making sure that each time the page is written, we use a different nonce, but without the expense of having to advance the LSN. Now, another option is to do what you suggest here. We could say that if a dirty page is evicted, but the page is only dirty because of hint-type changes, we don't actually write it out. That does avoid using the same nonce for multiple writes, because now there's only one write. It also fixes the problem on standbys that Andres was complaining about, because on a standby, the only way a page can possibly be dirtied without an associated WAL record is through a hint-type change. However, I think we'd find that this, too, is pretty expensive in certain workloads. It's useful to write hint bits - that's why we do it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-26 07:14:47 +0200, Antonin Houska wrote: > > Bruce Momjian wrote: > > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > > My point was about whether we need to change the nonce, and hence > > > WAL-log full page images if we change hint bits. If we don't and > > > reencrypt the page with the same nonce, don't we only expose the hint > > > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > > > cases. > > > > > > I don't understand your computation above. You decrypt the page into > > > shared buffers, you change a hint bit, and rewrite the page. You are > > > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > > > only change the hint bits in the new write? > > Yea, I had a bit of a misfire there. Sorry. > > I suspect that if we try to not disclose data if an attacker has write > access, this still leaves us with issues around nonce reuse, unless we > also employ integrity measures. Particularly due to CTR mode, which > makes it easy to manipulate individual parts of the encrypted page > without causing the decrypted page to be invalid. E.g. the attacker can > just update pd_upper on the page by a small offset, and suddenly the > replay will insert the tuple at a slightly shifted offset - which then > seems to leak enough data to actually analyze things? Yes, I don't think protecting from write access is a realistic goal at this point, and frankly ever. I think write access protection needs all-cluster-file encryption. This is documented: https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch Cluster file encryption does not protect against unauthorized file system writes. Such writes can allow data decryption if used to weaken the system's security and the weakened system is later supplied with the externally-stored cluster encryption key. This also does not always detect if users with write access remove or modify database files. If this needs more text, let me know. > As the patch stands that seems trivially doable, because as I read it, > most of the page header is not encrypted, and checksums are done of the > already encrypted data. But even if that weren't the case, brute forcing > 16bit worth of checksum isn't too bad, even though it would obviously > make an attack a lot more noisy. > > https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92 > /* >* Check if the page has a special size == GISTPageOpaqueData, a valid >* GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN. > This >* is true for all GiST pages, and perhaps a few pages that are not. > The >* only downside of guessing wrong is that we might not update the LSN > for >* some non-permanent relation page changes, and therefore reuse the IV, >* which seems acceptable. >*/ > > Huh? Are you asking about this C commention in relation to the discussion above, or is it an independent question? Are asking what it means? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: CREATE COLLATION - check for duplicate options and error out if found one
On Wed, May 26, 2021 at 7:44 PM Bharath Rupireddy wrote: > > On Wed, May 26, 2021 at 7:18 PM vignesh C wrote: > > +1 for fixing this issue, we have handled this error in other places. > > The patch does not apply on head, could you rebase the patch on head > > and post a new patch. > > Thanks. I thought of rebasing once the other patch (which reorganizes > "...specified more than once" error) gets committed. Anyways, I've > rebased for now on the latest master. Please review v2 patch. The test changes look good to me, I liked the idea of rebasing the source changes once "specified more than once" patch gets committed. I will review the code changes after that. Regards, Vignesh
Re: storing an explicit nonce
On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-25 22:23:46 -0400, Stephen Frost wrote: > > Andres mentioned other possible cases where the LSN doesn’t change even > > though we change the page and, as he’s probably right, we would have to > > figure out a solution in those cases too (potentially including cases like > > crash recovery or replay on a replica where we can’t really just go around > > creating dummy WAL records to get new LSNs..). > > Yea, I think there's quite a few of those. For one, we don't guarantee > that that the hole between pd_lower/upper is zeroes. It e.g. contains > old tuple data after deleted tuples are pruned away. But when logging an > FPI, we omit that range. Which means that after crash recovery the area > is zeroed out. There's several cases where padding can result in the > same. > > Just look at checkXLogConsistency(), heap_mask() et al for all the > differences that can occur and that need to be ignored for the recovery > consistency checking to work. > > Particularly the hole issue seems trivial to exploit, because we know > the plaintext of the hole after crash recovery (0s). > > > I don't see how using the LSN alone is salvagable. OK, so you are saying the replica would have all zeros because of crash recovery, so XOR'ing that with the encryption steam makes the encryption stream visible, and you could use that to decrypt the dead data on the primary. That is an interesting case that would need to fix. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > If we used a block cipher instead of a streaming one (CTR), this might > > not work because the earlier blocks can be based in the output of > > later blocks. > > What made us choose CTR for WAL & data file encryption? I checked the > README in the patchset and the wiki page, and neither seem to discuss > that. > > The dangers around nonce reuse, the space overhead of storing the nonce, > the fact that single bit changes in the encrypted data don't propagate > seem not great? Why aren't we using something like XTS? It has obvious > issues as wel, but CTR's weaknesses seem at least as great. And if we > want a MAC, then we don't want CTR either. We chose CTR because it was fast, and we could use the same method for WAL, which needs a streaming, not block, cipher. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
On Thu, May 27, 2021 at 05:45:21PM +0800, Neil Chen wrote: > Greetings, > > On Thu, May 27, 2021 at 4:52 PM Bruce Momjian wrote: > > > > > I am confused why checksums, which are widely used, acceptably require > > wal_log_hints, but there is concern that file encryption, which is > > heavier, cannot acceptably require wal_log_hints. I must be missing > > something. > > > > Why can't checksums also throw away hint bit changes like you want to do > > for file encryption and not require wal_log_hints? > > > > I'm really confused about it, too. I read the above communication, not sure if > my understanding is correct... What we are facing is not only the change of > flag such as *pd_flags*, but also others like pointer array changes in btree > like Robert said. We don't need them to write a WAL record. Well, the code now does write full page images for hint bit changes, so it should work fine. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote: > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian wrote: > > You are saying that by using a non-LSN nonce, you can write out the page > > with a new nonce, but the same LSN, and also discard the page during > > crash recovery and use the WAL copy? > > I don't know what "discard the page during crash recovery and use the > WAL copy" means. I was asking how decoupling the nonce from the LSN allows for us to avoid full page writes for hint bit changes. I am guessing you are saying that on recovery, if we see a hint-bit-only change in the WAL (with a new nonce), we just throw away the page because it could be torn and use the WAL full page write version. > > I am confused why checksums, which are widely used, acceptably require > > wal_log_hints, but there is concern that file encryption, which is > > heavier, cannot acceptably require wal_log_hints. I must be missing > > something. > > I explained this in the first complete paragraph of my first email > with this subject line: "For example, right now, we only need to WAL > log hints for the first write to each page after a checkpoint, but in > this approach, if the same page is written multiple times per > checkpoint cycle, we'd need to log hints every time." That's a huge > difference. Page eviction in some workloads can push the same pages > out of shared buffers every few seconds, whereas something that has to > be done once per checkpoint cycle cannot affect each page nearly so > often. A checkpoint is only going to occur every 5 minutes by default, > or more realistically every 10-15 minutes in a well-tuned production > system. In other words, we're not holding up some kind of double > standard, where the existing feature is allowed to depend on doing a > certain thing but your feature isn't allowed to depend on the same > thing. Your design depends on doing something which is potentially > 100x+ more expensive than the existing thing. It's not always going to > be that expensive, but it can be. Yes, it might be 1e100+++ more expensive too, but we don't know, and I am not ready to add a lot of complexity for such an unknown. > > Why can't checksums also throw away hint bit changes like you want to do > > for file encryption and not require wal_log_hints? > > Well, I don't want to throw away hint bit changes, just like we don't > throw them away right now. And I want to do that by making sure that > each time the page is written, we use a different nonce, but without > the expense of having to advance the LSN. > > Now, another option is to do what you suggest here. We could say that > if a dirty page is evicted, but the page is only dirty because of > hint-type changes, we don't actually write it out. That does avoid > using the same nonce for multiple writes, because now there's only one > write. It also fixes the problem on standbys that Andres was > complaining about, because on a standby, the only way a page can > possibly be dirtied without an associated WAL record is through a > hint-type change. However, I think we'd find that this, too, is pretty > expensive in certain workloads. It's useful to write hint bits - > that's why we do it. Oh, that does sound nice. It is kind of an exit hatch if we are evicting pages often for hint bit changes. I like it. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Wed, May 26, 2021 at 7:55 PM Bharath Rupireddy wrote: > > On Wed, May 26, 2021 at 7:38 PM vignesh C wrote: > > > Attaching v5 patch, please have a look. > > > > We get the following error while adding an index: > > create publication mypub for table idx_t1; > > ERROR: "idx_t1" is an index > > > > This error occurs internally from table_openrv function call, if we > > could replace this with relation_openrv and then check the table kind, > > we could throw a similar error message here too like the other changes > > in the patch. > > Do you say that we replace table_open in publication_add_relation with > relation_open and have the "\"%s\" is an index" or "\"%s\" is a > composite type" checks in check_publication_add_relation? If that is > so, I don't think it's a good idea to have the extra code in > check_publication_add_relation and I would like it to be the way it is > currently. Before calling check_publication_add_relation, we will call OpenTableList to get the list of relations. In openTableList we don't include the errordetail for the failure like you have fixed it in check_publication_add_relation. When a user tries to add index objects or composite types, the error will be thrown earlier itself. I didn't mean to change check_publication_add_relation, I meant to change table_openrv to relation_openrv in OpenTableList and include error details in case of failure like the change attached. If you are ok, please include the change in your patch. Regards, Vignesh diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c index 95c253c8e0..b6380d7491 100644 --- a/src/backend/commands/publicationcmds.c +++ b/src/backend/commands/publicationcmds.c @@ -16,6 +16,7 @@ #include "access/genam.h" #include "access/htup_details.h" +#include "access/relation.h" #include "access/table.h" #include "access/xact.h" #include "catalog/catalog.h" @@ -523,7 +524,21 @@ OpenTableList(List *tables) /* Allow query cancel in case this takes a long time */ CHECK_FOR_INTERRUPTS(); - rel = table_openrv(rv, ShareUpdateExclusiveLock); + rel = relation_openrv(rv, ShareUpdateExclusiveLock); + if (rel->rd_rel->relkind == RELKIND_INDEX || + rel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("\"%s\" is an index", + RelationGetRelationName(rel)), + errdetail("Index objects not supported by publications."))); + else if (rel->rd_rel->relkind == RELKIND_COMPOSITE_TYPE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("\"%s\" is a composite type", + RelationGetRelationName(rel)), + errdetail("Composite type not supported by publications."))); + myrelid = RelationGetRelid(rel); /*
Re: storing an explicit nonce
Hi, On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > Hi, > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > If we used a block cipher instead of a streaming one (CTR), this might > > > not work because the earlier blocks can be based in the output of > > > later blocks. > > > > What made us choose CTR for WAL & data file encryption? I checked the > > README in the patchset and the wiki page, and neither seem to discuss > > that. > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > the fact that single bit changes in the encrypted data don't propagate > > seem not great? Why aren't we using something like XTS? It has obvious > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > want a MAC, then we don't want CTR either. > > We chose CTR because it was fast, and we could use the same method for > WAL, which needs a streaming, not block, cipher. The WAL is block oriented too. Andres
Re: storing an explicit nonce
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > not work because the earlier blocks can be based in the output of > > > > later blocks. > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > README in the patchset and the wiki page, and neither seem to discuss > > > that. > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > the fact that single bit changes in the encrypted data don't propagate > > > seem not great? Why aren't we using something like XTS? It has obvious > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > want a MAC, then we don't want CTR either. > > > > We chose CTR because it was fast, and we could use the same method for > > WAL, which needs a streaming, not block, cipher. > > The WAL is block oriented too. I'm curious what you'd suggest for the heap where we wouldn't be able to have block chaining (at least, I presume we aren't talking about rewriting entire segments whenever we change something in a heap). Thanks, Stephen signature.asc Description: PGP signature
Re: storing an explicit nonce
Hi, On 2021-05-27 11:49:33 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > > not work because the earlier blocks can be based in the output of > > > > > later blocks. > > > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > > README in the patchset and the wiki page, and neither seem to discuss > > > > that. > > > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > > the fact that single bit changes in the encrypted data don't propagate > > > > seem not great? Why aren't we using something like XTS? It has obvious > > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > > want a MAC, then we don't want CTR either. > > > > > > We chose CTR because it was fast, and we could use the same method for > > > WAL, which needs a streaming, not block, cipher. > > > > The WAL is block oriented too. > > I'm curious what you'd suggest for the heap where we wouldn't be able to > have block chaining (at least, I presume we aren't talking about > rewriting entire segments whenever we change something in a heap). What prevents us from using something like XTS? I'm not saying that that is the right approach, due to the fact that it leaks information about a block being the same as an earlier version of the same block. But right now we are talking about using CTR without addressing the weaknesses CTR has, where a failure to increase the nonce is fatal (the code even documents known cases where that could happen!), and where there's no error propagation within a block. Greetings, Andres Freund
Re: storing an explicit nonce
On Thu, May 27, 2021 at 11:19 AM Bruce Momjian wrote: > On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote: > > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian wrote: > > > You are saying that by using a non-LSN nonce, you can write out the page > > > with a new nonce, but the same LSN, and also discard the page during > > > crash recovery and use the WAL copy? > > > > I don't know what "discard the page during crash recovery and use the > > WAL copy" means. > > I was asking how decoupling the nonce from the LSN allows for us to > avoid full page writes for hint bit changes. I am guessing you are > saying that on recovery, if we see a hint-bit-only change in the WAL > (with a new nonce), we just throw away the page because it could be torn > and use the WAL full page write version. Well, in the design where the nonce is stored in the page, there is no need for every hint-type change to appear in the WAL at all. Once per checkpoint cycle, you need to write a full page image, as we do for checksums or wal_log_hints. The rest of the time, you can just bump the nonce and rewrite the page, same as we do today. > Yes, it might be 1e100+++ more expensive too, but we don't know, and I > am not ready to add a lot of complexity for such an unknown. No, it can't be 1e100+++ more expensive, because it's not realistically possible for a page to be written to disk 1e100+++ times per checkpoint cycle. It is however entirely possible for it to be written 100 times per checkpoint cycle. That is not something unknown about which we need to speculate; it is easy to see that this can happen, even on a simple test like pgbench with a data set larger than shared buffers. It is not right to confuse "we have no idea whether this will be expensive" with "how expensive this will be is workload-dependent," which is what you seem to be doing here. If we had no idea whether something would be expensive, then I agree that it might not be worth adding complexity for it, or maybe some testing should be done first to find out. But if we know for certain that in some workloads something can be very expensive, then we had better at least talk about whether it is worth adding complexity in order to resolve the problem. And that is the situation here. I am not even convinced that storing the nonce in the block is going to be more complex, because it seems to me that the patches I posted upthread worked out pretty cleanly. There are some things to discuss and think about there, for sure, but it is not like we are talking about inventing warp drive. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Hi, On 2021-05-27 12:01:16 -0400, Bruce Momjian wrote: > On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote: > > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > > not work because the earlier blocks can be based in the output of > > > > > later blocks. > > > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > > README in the patchset and the wiki page, and neither seem to discuss > > > > that. > > > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > > the fact that single bit changes in the encrypted data don't propagate > > > > seem not great? Why aren't we using something like XTS? It has obvious > > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > > want a MAC, then we don't want CTR either. > > > > > > We chose CTR because it was fast, and we could use the same method for > > > WAL, which needs a streaming, not block, cipher. > > > > The WAL is block oriented too. > > Well, AES block mode only does 16-byte blocks, as far as I know, and I > assume WAL is more granular than that. WAL is 8kB blocks by default. We only ever write it out with at least that granularity. > Also, you need to know the bytes _before_ the WAL do write a new > 16-byte block, so it seems overly complex for our usage too. See the XTS reference. Yes, it needs the previous 16bytes, but only within the 8kB page. Greetings, Andres Freund
Re: storing an explicit nonce
On Thu, May 27, 2021 at 12:03:00PM -0400, Robert Haas wrote: > On Thu, May 27, 2021 at 11:19 AM Bruce Momjian wrote: > > I was asking how decoupling the nonce from the LSN allows for us to > > avoid full page writes for hint bit changes. I am guessing you are > > saying that on recovery, if we see a hint-bit-only change in the WAL > > (with a new nonce), we just throw away the page because it could be torn > > and use the WAL full page write version. > > Well, in the design where the nonce is stored in the page, there is no > need for every hint-type change to appear in the WAL at all. Once per > checkpoint cycle, you need to write a full page image, as we do for > checksums or wal_log_hints. The rest of the time, you can just bump > the nonce and rewrite the page, same as we do today. What is it about having the nonce be the LSN that doesn't allow that to happen? Could we just create a dummy LSN record and assign that to the page and use that as a nonce. > > Yes, it might be 1e100+++ more expensive too, but we don't know, and I > > am not ready to add a lot of complexity for such an unknown. > > No, it can't be 1e100+++ more expensive, because it's not > realistically possible for a page to be written to disk 1e100+++ times > per checkpoint cycle. It is however entirely possible for it to be > written 100 times per checkpoint cycle. That is not something unknown > about which we need to speculate; it is easy to see that this can > happen, even on a simple test like pgbench with a data set larger than > shared buffers. I guess you didn't get my joke on that one. ;-) > It is not right to confuse "we have no idea whether this will be > expensive" with "how expensive this will be is workload-dependent," > which is what you seem to be doing here. If we had no idea whether > something would be expensive, then I agree that it might not be worth > adding complexity for it, or maybe some testing should be done first > to find out. But if we know for certain that in some workloads > something can be very expensive, then we had better at least talk > about whether it is worth adding complexity in order to resolve the > problem. And that is the situation here. Sure, but the downsides of avoiding it seem very high to me, not only in code complexity but in requiring dump/reload or logical replication to deploy. > I am not even convinced that storing the nonce in the block is going > to be more complex, because it seems to me that the patches I posted > upthread worked out pretty cleanly. There are some things to discuss > and think about there, for sure, but it is not like we are talking > about inventing warp drive. See above. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
Hi, On 2021-05-27 11:10:00 -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote: > > On 2021-05-25 22:23:46 -0400, Stephen Frost wrote: > > > Andres mentioned other possible cases where the LSN doesn’t change even > > > though we change the page and, as he’s probably right, we would have to > > > figure out a solution in those cases too (potentially including cases like > > > crash recovery or replay on a replica where we can’t really just go around > > > creating dummy WAL records to get new LSNs..). > > > > Yea, I think there's quite a few of those. For one, we don't guarantee > > that that the hole between pd_lower/upper is zeroes. It e.g. contains > > old tuple data after deleted tuples are pruned away. But when logging an > > FPI, we omit that range. Which means that after crash recovery the area > > is zeroed out. There's several cases where padding can result in the > > same. > > > > Just look at checkXLogConsistency(), heap_mask() et al for all the > > differences that can occur and that need to be ignored for the recovery > > consistency checking to work. > > > > Particularly the hole issue seems trivial to exploit, because we know > > the plaintext of the hole after crash recovery (0s). > > > > > > I don't see how using the LSN alone is salvagable. > > OK, so you are saying the replica would have all zeros because of crash > recovery, so XOR'ing that with the encryption steam makes the encryption > stream visible, and you could use that to decrypt the dead data on the > primary. That is an interesting case that would need to fix. I don't see how it's a viable security model to assume that you can ensure that we never write different data with the same LSN. Yes, you can fix a few cases, but how can we be confident that we're actually doing a good job, when the consequences are pretty dramatic. Nor do I think it's architecturally OK to impose a significant new hurdle against doing any sort of "changing" writes on standbys. It's time to move on from the idea of using the LSN as the nonce. Greetings, Andres Freund
Re: storing an explicit nonce
On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > Rather than surprise anyone, I might as well just come out and say some > things. First, I have always admitted this feature has limited > usefulness. > > I think a non-LSN nonce adds a lot of code complexity, which adds a code > and maintenance burden. It also prevents the creation of an encrypted > replica from a non-encrypted primary using binary replication, which > makes deployment harder. > > Take a feature of limited usefulness, add code complexity and deployment > difficulty, and the feature becomes even less useful. > > For these reasons, if we decide to go in the direction of using a > non-LSN nonce, I no longer plan to continue working on this feature. I > would rather work on things that have a more positive impact. Maybe a > non-LSN nonce is a better long-term plan, but there are too many > unknowns and complexity for me to feel comfortable with it. I had some more time to think about this. The big struggle for this feature has not been writing it, but rather keeping it lean enough that its code complexity will be acceptable for a feature of limited usefulness. (The Windows port and pg_upgrade took similar approaches.) Thinking about the feature to add checksums online, it seems to have failed due to us over-complexifying the feature. If we had avoided allowing the checksum restart requirement, the patch would probably be part of Postgres today. However, a few people asked for restart-ability, and since we don't really have much infrastructure to do online whole-cluster changes, it added a lot of code. Once the patch was done, we looked at the code size and the benefits of the feature, and decided it wasn't worth it. I suspect that if we start adding a non-LSN nonce and malicious write detection, we will end up with the same problem --- a complex patch for a feature that has limited usefulness, and requires dump/restore or logical replication to add it to a cluster. I think such a patch would be rejected, and I would probably even vote against it myself. I don't want this to sound like I only want to do this my way, but I also don't want to be silent when I smell failure, and if the probability of failure gets too high, I am willing to abandon a feature rather than continue. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote: > Hi, > > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > not work because the earlier blocks can be based in the output of > > > > later blocks. > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > README in the patchset and the wiki page, and neither seem to discuss > > > that. > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > the fact that single bit changes in the encrypted data don't propagate > > > seem not great? Why aren't we using something like XTS? It has obvious > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > want a MAC, then we don't want CTR either. > > > > We chose CTR because it was fast, and we could use the same method for > > WAL, which needs a streaming, not block, cipher. > > The WAL is block oriented too. Well, AES block mode only does 16-byte blocks, as far as I know, and I assume WAL is more granular than that. Also, you need to know the bytes _before_ the WAL do write a new 16-byte block, so it seems overly complex for our usage too. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
Hi, On 2021-05-27 10:57:24 -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote: > > I suspect that if we try to not disclose data if an attacker has write > > access, this still leaves us with issues around nonce reuse, unless we > > also employ integrity measures. Particularly due to CTR mode, which > > makes it easy to manipulate individual parts of the encrypted page > > without causing the decrypted page to be invalid. E.g. the attacker can > > just update pd_upper on the page by a small offset, and suddenly the > > replay will insert the tuple at a slightly shifted offset - which then > > seems to leak enough data to actually analyze things? > > Yes, I don't think protecting from write access is a realistic goal at > this point, and frankly ever. I think write access protection needs > all-cluster-file encryption. This is documented: > > > https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch > > Cluster file encryption does not protect against unauthorized > file system writes. Such writes can allow data decryption if > used to weaken the system's security and the weakened system is > later supplied with the externally-stored cluster encryption key. > This also does not always detect if users with write access remove > or modify database files. > > If this needs more text, let me know. Well, it's one thing to say that it's not a complete protection, and another that a few byte sized writes to a single page are sufficient to get access to encrypted data. And "all-cluster-file" encryption won't help against the type of scenario I outlined. > > https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92 > > /* > > * Check if the page has a special size == GISTPageOpaqueData, a valid > > * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN. > > This > > * is true for all GiST pages, and perhaps a few pages that are not. > > The > > * only downside of guessing wrong is that we might not update the LSN > > for > > * some non-permanent relation page changes, and therefore reuse the IV, > > * which seems acceptable. > > */ > > > > Huh? > > Are you asking about this C commention in relation to the discussion > above, or is it an independent question? Are asking what it means? The comment is blithely waving away a fundamental no-no (reusing nonces) when using CTR mode as "acceptable". Greetings, Andres Freund
Re: storing an explicit nonce
On Thu, May 27, 2021 at 12:01 PM Andres Freund wrote: > What prevents us from using something like XTS? I'm not saying that that > is the right approach, due to the fact that it leaks information about a > block being the same as an earlier version of the same block. But right > now we are talking about using CTR without addressing the weaknesses CTR > has, where a failure to increase the nonce is fatal (the code even > documents known cases where that could happen!), and where there's no > error propagation within a block. I spent some time this morning reading up on XTS in general and also on previous discussions on this list on the list. It seems like XTS is considered state-of-the-art for full disk encryption, and what we're doing seems to me to be similar in concept. The most useful on-list discussion that I found was on this thread: https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df There are a lot of things that people said on that thread, but then Bruce basically proposes CBC and/or CTR and I couldn't clearly understand the reasons for that choice. Maybe there was some off-list discussion of this that wasn't captured in the email traffic? All that having been said, I am pretty sure I don't fully understand what any of these modes involve. I gather that XTS requires two keys, but it seems like it doesn't require a nonce. It seems to use a "tweak" that is generated from the block number and the position within the block (since an e.g. 8kB database block is being encrypted as a bunch of 16-byte AES blocks) but apparently there's no problem with the tweak being the same every time the block is encrypted? If no nonce is required, that seems like a massive advantage, since then we don't need to worry about how to get one or about how to make sure it's never reused. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Hi, On 2021-05-27 10:47:13 -0400, Robert Haas wrote: > Now, another option is to do what you suggest here. We could say that > if a dirty page is evicted, but the page is only dirty because of > hint-type changes, we don't actually write it out. That does avoid > using the same nonce for multiple writes, because now there's only one > write. It also fixes the problem on standbys that Andres was > complaining about, because on a standby, the only way a page can > possibly be dirtied without an associated WAL record is through a > hint-type change. What does that protect against that I was concerned about? That still allows hint bits to be leaked, via 1) replay WAL record with FPI 2) hint bit change during read 3) incremental page change vs 1) 3). Even if we declare that OK, it doesn't actually address the whole issue of WAL replay not necessarily re-creating bit identical page contents. Greetings, Andres Freund
Re: storing an explicit nonce
Hi, On 2021-05-27 12:28:39 -0400, Robert Haas wrote: > All that having been said, I am pretty sure I don't fully understand > what any of these modes involve. I gather that XTS requires two keys, > but it seems like it doesn't require a nonce. It needs a second secret, but that second secret can - as far as I understand it - be generated using a strong prng and encrypted with the "main" key, and stored in a central location. > It seems to use a "tweak" that is generated from the block number and > the position within the block (since an e.g. 8kB database block is > being encrypted as a bunch of 16-byte AES blocks) but apparently > there's no problem with the tweak being the same every time the block > is encrypted? Right. That comes with a price however: It leaks the information that a block "version" is identical to an earlier version of the block. That's obviously better than leaking information that allows decryption like with the nonce reuse issue. Nor does it provide integrity - which does seem like a significant issue going forward. Which does require storing additional per-page data... Greetings, Andres Freund
Re: storing an explicit nonce
On Thu, May 27, 2021 at 12:28:39PM -0400, Robert Haas wrote: > On Thu, May 27, 2021 at 12:01 PM Andres Freund wrote: > > What prevents us from using something like XTS? I'm not saying that that > > is the right approach, due to the fact that it leaks information about a > > block being the same as an earlier version of the same block. But right > > now we are talking about using CTR without addressing the weaknesses CTR > > has, where a failure to increase the nonce is fatal (the code even > > documents known cases where that could happen!), and where there's no > > error propagation within a block. > > I spent some time this morning reading up on XTS in general and also > on previous discussions on this list on the list. It seems like XTS is > considered state-of-the-art for full disk encryption, and what we're > doing seems to me to be similar in concept. The most useful on-list > discussion that I found was on this thread: > > https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df > > There are a lot of things that people said on that thread, but then > Bruce basically proposes CBC and/or CTR and I couldn't clearly > understand the reasons for that choice. Maybe there was some off-list > discussion of this that wasn't captured in the email traffic? There was no other discussion about XTS that I know of. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
On Thu, May 27, 2021 at 12:31 PM Andres Freund wrote: > What does that protect against that I was concerned about? That still > allows hint bits to be leaked, via > > 1) replay WAL record with FPI > 2) hint bit change during read > 3) incremental page change > > vs 1) 3). Even if we declare that OK, it doesn't actually address the > whole issue of WAL replay not necessarily re-creating bit identical page > contents. You're right. That seems fatal, as it would lead to encrypting the different versions of the page with the IV on the master and the standby, and the differences would consist of old data that could be recovered by XORing the two encrypted page versions. To be clear, it is tuple data that would be recovered, not just hint bits. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2021-05-27 12:28:39 -0400, Robert Haas wrote: > > All that having been said, I am pretty sure I don't fully understand > > what any of these modes involve. I gather that XTS requires two keys, > > but it seems like it doesn't require a nonce. > > It needs a second secret, but that second secret can - as far as I > understand it - be generated using a strong prng and encrypted with the > "main" key, and stored in a central location. Yes, I'm fairly confident this is the case. > > It seems to use a "tweak" that is generated from the block number and > > the position within the block (since an e.g. 8kB database block is > > being encrypted as a bunch of 16-byte AES blocks) but apparently > > there's no problem with the tweak being the same every time the block > > is encrypted? > > Right. That comes with a price however: It leaks the information that a > block "version" is identical to an earlier version of the block. That's > obviously better than leaking information that allows decryption like > with the nonce reuse issue. Right, if we simply can't solve the nonce-reuse concern then that would be better. > Nor does it provide integrity - which does seem like a significant issue > going forward. Which does require storing additional per-page data... Yeah, this is one of the reasons that I hadn't really been thrilled with XTS- I've really been looking down the road at eventually having GCM and having actual integrity validation included. That's not really a reason to rule it out though and Bruce's point about having a way to get to an encrypted cluster from an unencrypted one is certainly worth consideration. Naturally, we'd need to document everything appropriately but there isn't anything saying that we couldn't, say, have XTS in v15 without any adjustments to the page layout, accepting that there's no data integrity validation and focusing just on encryption, and then returning to the question about adding in data integrity validation for a future version, perhaps using the special area for a nonce+tag with GCM or maybe something else. Users who wish to move to a cluster with encryption and data integrity validation would have to get there through some other means than replication, but that's going to always be the case because we have to have space to store the tag, even if we can figure out some other solution for the nonce. Thanks, Stephen signature.asc Description: PGP signature
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Thu, May 27, 2021 at 9:02 PM vignesh C wrote: > > Do you say that we replace table_open in publication_add_relation with > > relation_open and have the "\"%s\" is an index" or "\"%s\" is a > > composite type" checks in check_publication_add_relation? If that is > > so, I don't think it's a good idea to have the extra code in > > check_publication_add_relation and I would like it to be the way it is > > currently. > > Before calling check_publication_add_relation, we will call > OpenTableList to get the list of relations. In openTableList we don't > include the errordetail for the failure like you have fixed it in > check_publication_add_relation. When a user tries to add index objects > or composite types, the error will be thrown earlier itself. I didn't > mean to change check_publication_add_relation, I meant to change > table_openrv to relation_openrv in OpenTableList and include error > details in case of failure like the change attached. If you are ok, > please include the change in your patch. I don't think we need to change that. General intuition is that with CREATE PUBLICATION ... FOR TABLE/FOR ALL TABLES one can specify only tables and if at all an index/composite type is specified, the error messages """ is an index"/""" is a composite type" can imply that they are not supported with CREATE PUBLICATION. There's no need for a detailed error message saying "Index/Composite Type cannot be added to publications.". Whereas foreign/unlogged/temporary/system tables are actually tables, and we don't support them. So a detailed error message here can state that explicitly. I'm not taking the patch, attaching v5 again here to make cfbot happy and for further review. BTW, when we use relation_openrv, we have to use relation_close. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From c4c4efa96028fb9fe0b88e91c064a35484c6d8f0 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 5 Apr 2021 19:00:24 +0530 Subject: [PATCH v5] Improve publication error messages Adding a foreign table into a publication prints an error saying "foo is not a table". Although, a foreign table is not a regular table, this message could possibly confuse users. Provide a suitable error message according to the object class (table vs foreign table). While at it, separate unlogged/temp table error message into 2 messages. --- .../postgres_fdw/expected/postgres_fdw.out| 6 ++ contrib/postgres_fdw/sql/postgres_fdw.sql | 5 + src/backend/catalog/pg_publication.c | 20 --- src/test/regress/expected/publication.out | 16 +++ src/test/regress/sql/publication.sql | 14 + 5 files changed, 58 insertions(+), 3 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 59e4e27ffb..b14fd8618c 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -9933,3 +9933,9 @@ DROP TABLE base_tbl4; DROP TABLE join_tbl; ALTER SERVER loopback OPTIONS (DROP async_capable); ALTER SERVER loopback2 OPTIONS (DROP async_capable); +-- === +-- test error case for create publication on foreign table +-- === +CREATE PUBLICATION testpub_foreign_tbl FOR TABLE ft1; --ERROR +ERROR: "ft1" is a foreign table +DETAIL: Foreign tables cannot be added to publications. diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 107d1c0e03..561c9af23a 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -3138,3 +3138,8 @@ DROP TABLE join_tbl; ALTER SERVER loopback OPTIONS (DROP async_capable); ALTER SERVER loopback2 OPTIONS (DROP async_capable); + +-- === +-- test error case for create publication on foreign table +-- === +CREATE PUBLICATION testpub_foreign_tbl FOR TABLE ft1; --ERROR diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 86e415af89..592d2d79e1 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -49,6 +49,14 @@ static void check_publication_add_relation(Relation targetrel) { + /* FOREIGN table cannot be part of publication. */ + if (RelationGetForm(targetrel)->relkind == RELKIND_FOREIGN_TABLE) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("\"%s\" is a foreign table", + RelationGetRelationName(targetrel)), + errdetail("Foreign tables cannot be added to publications."))); + /* Must be a regular or partitioned table */ if (RelationGetForm(targetrel)->relkind != RELKIND_RELATION && RelationGetForm(targetrel)->relkind != RE
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
Hi, On 2021-05-26 16:57:31 +0900, Michael Paquier wrote: > Yes, there should not be any as far as I recall. 2PC is kind of > special with its fake ProcArray entries. It's really quite an awful design :( > > I think to fix the issue we'd have to move > > ShutdownRecoveryTransactionEnvironment() to after > > XLogCtl->SharedRecoveryState > > = RECOVERY_STATE_DONE. > > > > The acquisition of ProcArrayLock() in > > ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds() > > should prevent the data from being removed between the RecoveryInProgress() > > and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData(). > > > > I haven't yet figured out whether there would be a problem with deferring > > the > > other tasks in ShutdownRecoveryTransactionEnvironment() until after > > RECOVERY_STATE_DONE. > > Hmm. This would mean releasing all the exclusive locks tracked by a > standby, as of StandbyReleaseAllLocks(), after opening the instance > for writes after a promotion. I don't think that's unsafe, but it > would be intrusive. Why would it be intrusive? We're talking a split second here, no? More importantly, I don't think it's correct to release the locks at that point. > Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself, > where we should try to not wipe out the 2PC entries to make sure that > all those snapshots still see the 2PC transactions as something to > count on? I am attaching a crude patch to show the idea. I don't think that's sufficient. We can't do most of the other stuff in ShutdownRecoveryTransactionEnvironment() before changing XLogCtl->SharedRecoveryState either. As long as the other backends think we are in recovery, we shouldn't release e.g. the virtual transaction. Greetings, Andres Freund
Re: CREATE COLLATION - check for duplicate options and error out if found one
On Thu, May 27, 2021 at 8:36 PM vignesh C wrote: > > On Wed, May 26, 2021 at 7:44 PM Bharath Rupireddy > wrote: > > > > On Wed, May 26, 2021 at 7:18 PM vignesh C wrote: > > > +1 for fixing this issue, we have handled this error in other places. > > > The patch does not apply on head, could you rebase the patch on head > > > and post a new patch. > > > > Thanks. I thought of rebasing once the other patch (which reorganizes > > "...specified more than once" error) gets committed. Anyways, I've > > rebased for now on the latest master. Please review v2 patch. > > The test changes look good to me, I liked the idea of rebasing the > source changes once "specified more than once" patch gets committed. I > will review the code changes after that. I'm not sure which patch goes first. I think the review can be finished and see which one will be picked up first by the committer. Based on that, the other patch can be rebased. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Hi, On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > That's not really a reason to rule it out though and Bruce's point about > having a way to get to an encrypted cluster from an unencrypted one is > certainly worth consideration. Naturally, we'd need to document > everything appropriately but there isn't anything saying that we > couldn't, say, have XTS in v15 without any adjustments to the page > layout, accepting that there's no data integrity validation and focusing > just on encryption, and then returning to the question about adding in > data integrity validation for a future version, perhaps using the > special area for a nonce+tag with GCM or maybe something else. Users > who wish to move to a cluster with encryption and data integrity > validation would have to get there through some other means than > replication, but that's going to always be the case because we have to > have space to store the tag, even if we can figure out some other > solution for the nonce. But won't we then end up with a different set of requirements around nonce assignment durability when introducing GCM support? That's not actually entirely trivial to do correctly on a standby. I guess we can use AES-GCM-SSIV and be ok with living with edge cases leading to nonce reuse, but ... Greetings, Andres Freund
Re: Reducing the range of OIDs consumed by genbki.pl
On Wed, May 26, 2021 at 5:43 PM Tom Lane wrote: > > The attached patch stems from the conversation at [1]; > I'm starting a new thread to avoid confusing the cfbot. The patch looks good to me. -- John Naylor EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > > That's not really a reason to rule it out though and Bruce's point about > > having a way to get to an encrypted cluster from an unencrypted one is > > certainly worth consideration. Naturally, we'd need to document > > everything appropriately but there isn't anything saying that we > > couldn't, say, have XTS in v15 without any adjustments to the page > > layout, accepting that there's no data integrity validation and focusing > > just on encryption, and then returning to the question about adding in > > data integrity validation for a future version, perhaps using the > > special area for a nonce+tag with GCM or maybe something else. Users > > who wish to move to a cluster with encryption and data integrity > > validation would have to get there through some other means than > > replication, but that's going to always be the case because we have to > > have space to store the tag, even if we can figure out some other > > solution for the nonce. > > But won't we then end up with a different set of requirements around > nonce assignment durability when introducing GCM support? That's not > actually entirely trivial to do correctly on a standby. I guess we can > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > reuse, but ... Not sure if I'm entirely following the question but I would have thought the up-thread idea of generating a random part of the nonce for each start up and then a global counter for the rest, which would be written whenever the page is updated (meaning it wouldn't have anything to do with the LSN and would be stored in the special area as Robert contemplated) would work for both primaries and replicas. Taking a step back, while I like the idea of trying to think through these complications in a future world where we add GCM support, if we're actually agreed on seriously looking at XTS for v15 then maybe we should focus on that for the moment. As Bruce says, there's a lot of moving parts in this patch that likely need discussion and agreement in order for us to be able to move forward with it. For one, we'd probably want to get agreement on what we'd use to construct the tweak, for starters. Thanks, Stephen signature.asc Description: PGP signature
Re: Reducing the range of OIDs consumed by genbki.pl
John Naylor writes: > The patch looks good to me. Thanks for reviewing! regards, tom lane
Re: storing an explicit nonce
On Thu, May 27, 2021 at 12:15 PM Bruce Momjian wrote: > > Well, in the design where the nonce is stored in the page, there is no > > need for every hint-type change to appear in the WAL at all. Once per > > checkpoint cycle, you need to write a full page image, as we do for > > checksums or wal_log_hints. The rest of the time, you can just bump > > the nonce and rewrite the page, same as we do today. > > What is it about having the nonce be the LSN that doesn't allow that to > happen? Could we just create a dummy LSN record and assign that to the > page and use that as a nonce. I can't tell which of two possible proposals you are describing here. If the LSN is used to derive the nonce, then one option is to just log a WAL record every time we need a new nonce. As I understand it, that's basically what you've already implemented, and we've discussed the disadvantages of that approach at some length already. The basic problems seem to be: - It's potentially very expensive if page evictions are frequent, which they will be whenever the workload is write-heavy and the working set is larger than shared_buffers. - If there's ever a situation where we need to write a page image different from any page image written previously and we cannot at that time write a WAL record to generate a new LSN for use as the nonce, then the algorithm is broken entirely. Andres's latest post points out - I think correctly - that this happens on standbys, because WAL replay does not generate byte-identical results on standbys even if you ignore hint bits. The first point strikes me as a sufficiently serious performance problem to justify giving up on this design, but that's a judgement call. The second one seems like it breaks it entirely. Now, there's another possible direction that is also suggested by your remarks here: maybe you meant using a fake LSN in cases where we can't use a real one. For example, suppose you decide to reserve half of the LSN space - all LSNs with the high bit set, for example - for this purpose. Well, you somehow need to ensure that you never use one of those values more than once, so you might think of putting a counter in shared memory. But now imagine a master with two standbys. How would you avoid having the same counter value used on one standby and also on the other standby? Even if they use the same counter for different pages, it's a critical security flaw. And since those standbys don't even need to know that the other one exists, that seems pretty well impossible to avoid. Now you might ask why we don't have the same problem if we store the nonce in the special space. One difference is that if you store the nonce explicitly, you can allow however much bit space you need in order to guarantee uniqueness, whereas reserving half the LSN space only gives you 63 bits. That's not enough to achieve uniqueness without tight coordination. With 128 bits, you can do things like just generate random values and assume they're vanishingly unlikely to collide, or randomly generate half the value and use the other half as a counter and be pretty safe. With 63 bits you just don't have enough bit space available to reliably avoid collisions using algorithms of that type, due to the birthday paradox. I think it would be adequate for uniqueness if there were a single shared counter and every allocation came from it, but again, as soon as you imagine a master and a bunch of standbys, that breaks down. Also, it's not entirely clear to me that you can avoid needing the LSN space on the page for a real LSN at the same time you also need it for a fake-LSN-being-used-as-a-nonce. We rely on the LSN field containing the LSN of the last WAL record for the page in order to obey the WAL-before-data rule, without which crash recovery will not work reliably. Now, if you sometimes have to use that field for a nonce that is a fake LSN, that means you no longer always have a place to store the real LSN. I can't convince myself off-hand that it's completely impossible to work around that problem, but it seems like any attempt to do so would be complicated and fragile at best. I don't think that's a direction that we want to go. Making crash recovery work reliably is a hard problem where we've had lots of bugs despite years of dedicated effort. TDE is also complex and has lots of pitfalls of its own. If we take two things which are individually complicated and hard to get right and intertwine them by making them share bit-space, I think it drives the complexity up to a level where we don't have much hope of getting things right. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Hi, On 2021-05-27 12:00:03 -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > > Rather than surprise anyone, I might as well just come out and say some > > things. First, I have always admitted this feature has limited > > usefulness. > > > > I think a non-LSN nonce adds a lot of code complexity, which adds a code > > and maintenance burden. It also prevents the creation of an encrypted > > replica from a non-encrypted primary using binary replication, which > > makes deployment harder. > > > > Take a feature of limited usefulness, add code complexity and deployment > > difficulty, and the feature becomes even less useful. > > > > For these reasons, if we decide to go in the direction of using a > > non-LSN nonce, I no longer plan to continue working on this feature. I > > would rather work on things that have a more positive impact. Maybe a > > non-LSN nonce is a better long-term plan, but there are too many > > unknowns and complexity for me to feel comfortable with it. > > [...] > I suspect that if we start adding a non-LSN nonce and malicious write > detection, we will end up with the same problem --- a complex patch for > a feature that has limited usefulness, and requires dump/restore or > logical replication to add it to a cluster. I think such a patch would > be rejected, and I would probably even vote against it myself. I think it's diametrically the opposite. Using the LSN as the nonce requires that all code modifying pages needs to be audited (which clearly hasn't been done yet), whereas an independent nonce can be maintained in a few central places. And that's not just a one-off issue, it's a forevermore issue. Greetings, Andres Freund
Re: storing an explicit nonce
On Thu, May 27, 2021 at 12:49 PM Stephen Frost wrote: > Right, if we simply can't solve the nonce-reuse concern then that would > be better. Given the issues that Andres raised about standbys and the treatment of the "hole," I see using the LSN for the nonce as a dead-end. I think it's pretty bad on performance grounds too, for reasons already discussed, but you could always hypothesize that people care so much about security that they will ignore any amount of trouble with performance. You can hardly hypothesize that those same people also won't mind security vulnerabilities that expose tuple data, though. I don't think the idea of storing the nonce at the end of the page is dead. There seem to be some difficulties there, but I think there are reasonable prospects of solving them. At the very least there's the brute-force approach of generating a ton of cryptographically strong random numbers, and there seems to be some possibility of doing better than that. However, I'm pretty excited by this idea of using XTS. Now granted I didn't have the foggiest idea what XTS was before today, but I hear you and Andres saying that we can use that approach without needing a nonce at all. That seems to make a lot of the problems we're talking about here just go away. > > Nor does it provide integrity - which does seem like a significant issue > > going forward. Which does require storing additional per-page data... > > Yeah, this is one of the reasons that I hadn't really been thrilled with > XTS- I've really been looking down the road at eventually having GCM and > having actual integrity validation included. > > That's not really a reason to rule it out though and Bruce's point about > having a way to get to an encrypted cluster from an unencrypted one is > certainly worth consideration. Naturally, we'd need to document > everything appropriately but there isn't anything saying that we > couldn't, say, have XTS in v15 without any adjustments to the page > layout, accepting that there's no data integrity validation and focusing > just on encryption, and then returning to the question about adding in > data integrity validation for a future version, perhaps using the > special area for a nonce+tag with GCM or maybe something else. Users > who wish to move to a cluster with encryption and data integrity > validation would have to get there through some other means than > replication, but that's going to always be the case because we have to > have space to store the tag, even if we can figure out some other > solution for the nonce. +1 from me to all of this except the idea of foreclosing present discussion on how data-integrity validation could be made to work. I think it would great to have more discussion of that problem now, in case it informs our decisions about anything else, especially because, based on your earlier remarks, it seems like there is some coupling between the two problems. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Thu, May 27, 2021 at 10:39 AM Tom Lane wrote: > What I'm hearing is a whole lot of hypothesizing and zero evidence of > actual field requirements. On the other side of the coin, we've already > wasted significant person-hours on fixing this feature's memory leakage, > and now people are proposing to expend more effort on solving^Wpapering > over its performance issues by adding yet more user-visible complication. > It's already adding too much user-visible complication IMO --- I know > because I was just copy-editing the documentation about that yesterday. > > I say it's time to stop the bleeding and rip it out. When and if > there are actual field requests to have a way to do this, we can > discuss what's the best way to respond to those requests. Hacking > VACUUM probably isn't the best answer, anyway. But right now, > we are past feature freeze, and I think we ought to jettison this > one rather than quickly kluge something. Thanks for sharing your thoughts. -1 from me. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Hi, On 2021-05-27 13:26:11 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > > > That's not really a reason to rule it out though and Bruce's point about > > > having a way to get to an encrypted cluster from an unencrypted one is > > > certainly worth consideration. Naturally, we'd need to document > > > everything appropriately but there isn't anything saying that we > > > couldn't, say, have XTS in v15 without any adjustments to the page > > > layout, accepting that there's no data integrity validation and focusing > > > just on encryption, and then returning to the question about adding in > > > data integrity validation for a future version, perhaps using the > > > special area for a nonce+tag with GCM or maybe something else. Users > > > who wish to move to a cluster with encryption and data integrity > > > validation would have to get there through some other means than > > > replication, but that's going to always be the case because we have to > > > have space to store the tag, even if we can figure out some other > > > solution for the nonce. > > > > But won't we then end up with a different set of requirements around > > nonce assignment durability when introducing GCM support? That's not > > actually entirely trivial to do correctly on a standby. I guess we can > > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > > reuse, but ... > > Not sure if I'm entirely following the question It seems like it might end up with lots of duplicated effort to go for XTS in the short term, if we medium term then have to solve all the issues around how to maintain efficiently and correctly nonces anyway, because we want integrity support. > but I would have thought the up-thread idea of generating a random > part of the nonce for each start up and then a global counter for the > rest, which would be written whenever the page is updated (meaning it > wouldn't have anything to do with the LSN and would be stored in the > special area as Robert contemplated) would work for both primaries and > replicas. Yea, it's not a bad approach. Particularly because it removes the need to ensure that "global nonce counter" increments are guaranteed to be durable. > For one, we'd probably want to get agreement on what we'd use to > construct the tweak, for starters. Hm, isn't that just a pg_strong_random() and storing it encrypted? Greetings, Andres Freund
Re: Race condition in recovery?
On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi wrote: > So in the mail [1] and [2] I tried to describe what's going on around > the two issues. Although I haven't have a response to [2], can I > think that we clarified the intention of ee994272ca? And may I think > that we decided that we don't add a test for the commit? Regarding the first question, I feel that the intention of ee994272ca is fairly clear at this point. Someone else might feel differently so I won't presume to speak for anyone but me. Regarding the second question, I am not opposed to adding a test for that commit, but I think it is a lot more important to fix the bug we have now than to add a test for a bug that was fixed a long time ago. > Then it seems to me that Robert refound how to reproduce Dilip's case > using basebackup instead of using two standbys. It is using a broken > archive_command with pg_basebackup -Xnone and I showed that the same > resulting state is available by pg_basebackup -Xstream/fetch clearing > pg_wal directory of the resulting backup including an explanation of > why. Yes, it makes sense that we could get to the same state either by not fetching the WAL in the first place, or alternatively by fetching it and then removing it. > *I* think that it is better to avoid to have the archive_command since > it seems to me that just unlinking some files seems simpler tha having > the broken archive_command. However, since Robert ignored it, I guess > that Robert thinks that the broken archive_command is better than > that. Well ... I don't see those things as quite related. As far as I can see, unlinking files from pg_wal is an alternative to using -Xnone. On the other hand, the broken archive_command is there to make sure the new primary doesn't archive its WAL segment too soon. Regarding the first point, I think using -Xnone is better than using -Xfetch/stream and then removing the WAL, because (1) it doesn't seem efficient to fetch WAL only to turn around and remove it and (2) someone might question whether removing the WAL afterward is a supported procedure, whereas using an option built into the tool must surely be supported. Regarding the second point, I think using the broken archive command is superior because we can be sure of the behavior. If we just rely on not having crossed a segment boundary, then anything that causes more WAL to be generated than we are expecting could break the test. I don't think it's particularly likely in a case like this that autovacuum or any other thing would kick in and generate extra WAL, but the broken archive command ensures that even if it does happen, the test will still work as intended. That, to me, seems like a good enough reason to do it that way. > FWIW, regarding to the name of the test script, putting aside what it > actually does, I proposed to place it as a part or > 004_timeline_switch.pl because this issue is related to timeline > switching. I think it is better to keep it separate. Long test scripts that test multiple things with completely separate tests are hard to read. Kyotaro-san, I hope I have not given any offense. I am doing my best, and certainly did not mean to be rude. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2021-05-27 13:26:11 -0400, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > > > > That's not really a reason to rule it out though and Bruce's point about > > > > having a way to get to an encrypted cluster from an unencrypted one is > > > > certainly worth consideration. Naturally, we'd need to document > > > > everything appropriately but there isn't anything saying that we > > > > couldn't, say, have XTS in v15 without any adjustments to the page > > > > layout, accepting that there's no data integrity validation and focusing > > > > just on encryption, and then returning to the question about adding in > > > > data integrity validation for a future version, perhaps using the > > > > special area for a nonce+tag with GCM or maybe something else. Users > > > > who wish to move to a cluster with encryption and data integrity > > > > validation would have to get there through some other means than > > > > replication, but that's going to always be the case because we have to > > > > have space to store the tag, even if we can figure out some other > > > > solution for the nonce. > > > > > > But won't we then end up with a different set of requirements around > > > nonce assignment durability when introducing GCM support? That's not > > > actually entirely trivial to do correctly on a standby. I guess we can > > > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > > > reuse, but ... > > > > Not sure if I'm entirely following the question > > It seems like it might end up with lots of duplicated effort to go for > XTS in the short term, if we medium term then have to solve all the > issues around how to maintain efficiently and correctly nonces anyway, > because we want integrity support. You and Robert both seem to be going in that direction, one which I tend to share, while Bruce is very hard set against it from the perspective that he doesn't view integrity as important (I disagree quite strongly with that, even if we can't protect everything, I see it as certainly valuable to protect the primary data) and that this approach adds complexity (the amount of which doesn't seem to be agreed upon). I'm also not sure how much of the effort would really be duplicated. Were we to start with XTS, that's almost drop-in with what Bruce has (actually, it should simplify some parts since we no longer need to deal with making sure we always increase the LSN, etc) gives users more flexibility in terms of getting to an encrypted cluster and solves certain use-cases. Very little of that seems like it would be ripped out if we were to (also) provide a GCM option. Now, if we were to *only* provide a GCM option then maybe we wouldn't need to think about the XTS case of having to come up with a tweak (though that seems like a rather small amount of code) but that would also mean we need to change the page format and we can't do any kind of binary/page-level transistion to an encrypted cluster, like we could with XTS. Trying to break it down, the end-goal states look like: GCM-only: no binary upgrade path due to having to store the tag XTS-only: no data integrity option GCM+XTS: binary upgrade path for XTS, data integrity with GCM If we want both a binary upgrade path, and a data integrity option, then it seems like the only end state which provides both is GCM+XTS, in which case I don't think there's a lot of actual duplication. Perhaps there's an "XTS + some other data integrity approach" option where we could preserve the page format by stuffing information into another fork or maybe telling users to hash their data and store that hash as another column which would allow us to avoid implementing GCM, but I don't see a way to avoid having XTS if we are going to provide a binary upgrade path. Perhaps AES-GCM-SIV would be interesting to consider in general, but that still means we need to find space for the tag and that still precludes a binary upgrade path. > > but I would have thought the up-thread idea of generating a random > > part of the nonce for each start up and then a global counter for the > > rest, which would be written whenever the page is updated (meaning it > > wouldn't have anything to do with the LSN and would be stored in the > > special area as Robert contemplated) would work for both primaries and > > replicas. > > Yea, it's not a bad approach. Particularly because it removes the need > to ensure that "global nonce counter" increments are guaranteed to be > durable. Right. > > For one, we'd probably want to get agreement on what we'd use to > > construct the tweak, for starters. > > Hm, isn't that just a pg_strong_random() and storing it encrypted? Perhaps it is, but at least in some other cases it's generated based on sector and block (which maybe could be relfilenode and block for us?): https://medium.com/asecuritysite-whe
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On 2021-May-27, Tom Lane wrote: > I say it's time to stop the bleeding and rip it out. When and if > there are actual field requests to have a way to do this, we can > discuss what's the best way to respond to those requests. Hacking > VACUUM probably isn't the best answer, anyway. But right now, > we are past feature freeze, and I think we ought to jettison this > one rather than quickly kluge something. Sorry, I'm unclear on exactly what are you proposing. Are you proposing to rip out the fact that VACUUM FULL promises to recompress everything, or are you proposing to rip out the whole attcompression feature? Absolute -1 on the latter from me. Pluggable compression has taken years to get to this point, it certainly won't do to give that up. Now about the former. If we do think that recompressing causes an unacceptable 10% slowdown for every single VACUUM FULLs, then yeah we should discuss changing that behavior -- maybe remove promises of recompression and wait for pg15 to add "VACUUM (RECOMPRESS)" or similar. If it's a 10% slowdown of the only best times (variability unspecified) and only in corner cases (unlogged tables with no indexes that fit in shared buffers), then I don't think we should bother. -- Álvaro Herrera39°49'30"S 73°17'W "You're _really_ hosed if the person doing the hiring doesn't understand relational systems: you end up with a whole raft of programmers, none of whom has had a Date with the clue stick." (Andrew Sullivan)
Re: storing an explicit nonce
On Thu, May 27, 2021 at 1:07 PM Andres Freund wrote: > But won't we then end up with a different set of requirements around > nonce assignment durability when introducing GCM support? That's not > actually entirely trivial to do correctly on a standby. I guess we can > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > reuse, but ... All these different encryption modes are hard for me to grok. That said, I want to mention a point which I think may be relevant here. As far as I know, in the case of a permanent table page, we never write X then X' then X again. If the change is WAL-logged, then the LSN advances, and it will never thereafter go backward. Otherwise, it's something that uses MarkBufferDirtyHint(). As far as I know, all of those changes are one-way. For example, we set hint bits without logging the change, but anything that clears hint bits is logged. We mark btree index items dead as a type of hint, but they never come back to life again; instead, they get cleaned out of the page entirely as a WAL-logged operation. So I don't know that an adversary seeing the same exact ciphertext multiple times is really likely to occur. Well, it could certainly occur for temporary or unlogged tables, since those have LSN = 0. And in cases were we currently copy pages around, like creating a new database, it could happen. I suspect those cases could be fixed, if we cared enough, and there are independent reasons to want to fix the create-new-database case. It would be fairly easy to put fake LSNs in temporary buffers, since they're in a separate pool of buffers in backend-private memory with a separate buffer manager. And it could probably even be done for unlogged tables, though not as easily. Or we could use the special-space technique to put some unpredictable garbage into each page and then change the garbage every time we write the page. I read the discussion so far to say that maybe these kinds of measures aren't even needed, and if so, great. But even without doing anything, I don't think it's going to happen very much. Another case where this sort of thing might happen is a standby doing whatever the master did. I suppose that could be avoided if the standby always has its own encryption keys, but that forces a key rotation when you create a standby, and it doesn't seem like a lot of fun to insist on that. But the information leak seems minor. If we get to a point where an adversary with full filesystem access on all our systems can't do better than assessing our replication lag, we'll be a lot better off then than we are now. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
Hi, On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > I'm also not sure how much of the effort would really be duplicated. > > Were we to start with XTS, that's almost drop-in with what Bruce has > (actually, it should simplify some parts since we no longer need to deal > with making sure we always increase the LSN, etc) gives users more > flexibility in terms of getting to an encrypted cluster and solves > certain use-cases. Very little of that seems like it would be ripped > out if we were to (also) provide a GCM option. > Now, if we were to *only* provide a GCM option then maybe we wouldn't > need to think about the XTS case of having to come up with a tweak > (though that seems like a rather small amount of code) but that would > also mean we need to change the page format and we can't do any kind of > binary/page-level transistion to an encrypted cluster, like we could > with XTS. > Trying to break it down, the end-goal states look like: > > GCM-only: no binary upgrade path due to having to store the tag > XTS-only: no data integrity option > GCM+XTS: binary upgrade path for XTS, data integrity with GCM Why would GCM + XTS make sense? Especially if we were to go with AES-GCM-SIV or something, drastically reducing the danger of nonce reuse? And I don't think there's an easy way to do both using openssl, without double encrypting, which we'd obviously not want for performance reasons. And I don't think we'd want to implement either ourselves - leaving other dangers aside, I don't think we want to do the optimization work necessary to get good performance. > If we want both a binary upgrade path, and a data integrity option, then > it seems like the only end state which provides both is GCM+XTS, in > which case I don't think there's a lot of actual duplication. I honestly feel that Bruce's point about trying to shoot for the moon, and thus not getting the basic feature done, applies much more to the binary upgrade path than anything else. I think we should just stop aiming for that for now. If we later want to add code that goes through the cluster to ensure that there's enough space on each page for integrity data, to provide a migration path, fine. But we shouldn't make the binary upgrade path for TED a hard requirement. > > > For one, we'd probably want to get agreement on what we'd use to > > > construct the tweak, for starters. > > > > Hm, isn't that just a pg_strong_random() and storing it encrypted? > > Perhaps it is, but at least in some other cases it's generated based on > sector and block (which maybe could be relfilenode and block for us?): > > https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac My understanding is that you'd use tweak_secret + block_offset or someop(tweak_secret, relfilenode) block_offset to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak. Greetings, Andres Freund
Re: storing an explicit nonce
On Thu, May 27, 2021 at 3:22 PM Stephen Frost wrote: > Trying to break it down, the end-goal states look like: > > GCM-only: no binary upgrade path due to having to store the tag > XTS-only: no data integrity option > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > If we want both a binary upgrade path, and a data integrity option, then > it seems like the only end state which provides both is GCM+XTS, in > which case I don't think there's a lot of actual duplication. > > Perhaps there's an "XTS + some other data integrity approach" option > where we could preserve the page format by stuffing information into > another fork or maybe telling users to hash their data and store that > hash as another column which would allow us to avoid implementing GCM, > but I don't see a way to avoid having XTS if we are going to provide a > binary upgrade path. > > Perhaps AES-GCM-SIV would be interesting to consider in general, but > that still means we need to find space for the tag and that still > precludes a binary upgrade path. Anything that decouples features without otherwise losing ground is a win. If there are things A and B, such that A does encryption and B does integrity validation, and A and B can be turned on and off independently of each other, that is better than some otherwise-comparable C that provides both features. But I'm going to have to defer to you and Andres and whoever else on whether that's true for any encryption methods/modes in particular. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Thu, May 27, 2021 at 7:25 AM Robert Haas wrote: > TBH, I'm more concerned about the other direction. Surely someone who > upgrades from an existing release to v14 and sets their compression > method to lz4 is going to want a way of actually converting their data > to using lz4. Your argument would be more convincing (at least to me) if we really did expect users to want to pick and choose, based on natural variations in datasets that make switching to *either* potentially yield a real benefit. It is my understanding that lz4 is pretty much superior to pglz by every relevant measure, though, so I'm not sure that that argument can be made. At the same time, users tend to only care specifically about things that are real step changes -- which I don't think this qualifies as. Users will go out of their way to get one of those, but otherwise won't bother. Perhaps there is a practical argument in favor of VACUUM FULL reliably recompressing using lz4 on upgrade, where that's the user's stated preference. It's not self-evident that VACUUM FULL must or even should do that, at least to me. I'm not suggesting that there must not be such an argument. Just that I don't think that anybody has made such an argument. > To say that nobody cares about that is to deem the > feature useless. Maybe that's what Tom thinks, but it's not what I > think. I don't think that follows at all. -- Peter Geoghegan
Re: storing an explicit nonce
Hi, On 2021-05-27 15:48:09 -0400, Robert Haas wrote: > That said, I want to mention a point which I think may be relevant > here. As far as I know, in the case of a permanent table page, we > never write X then X' then X again. Well, there's crash recovery / restarts. And as previously explained they can end up with different page contents than before. > And in cases were we currently copy pages around, like creating a new > database, it could happen. As long as its identical data that should be fine, except leaking that the data is identical. Which doesn't make me really concerned in case of template databases. > I suspect those cases could be fixed, if we cared enough, and there > are independent reasons to want to fix the create-new-database > case. It would be fairly easy to put fake LSNs in temporary buffers, > since they're in a separate pool of buffers in backend-private memory > with a separate buffer manager. And it could probably even be done for > unlogged tables, though not as easily. [...] I read > the discussion so far to say that maybe these kinds of measures aren't > even needed, and if so, great. But even without doing anything, I > don't think it's going to happen very much. What precisely are you referring to with "aren't even needed"? I don't see how the fake LSN approach can work for the crash recovery issues? > Or we could use the special-space technique to put some unpredictable > garbage into each page and then change the garbage every time we write > the page Unfortunately with CTR mode that doesn't provide much protection, if it's part of the encrypted data (vs IV/nonce). A one bit change in the encrypted data only changes one bit in the unencrypted data, as the data is just XORd with the cipher stream. So random changes in one place doesn't prevent disclosure in other parts of the data if the nonce doesn't also change. And one can easily predict the effect of flipping certain bits. > Another case where this sort of thing might happen is a standby doing > whatever the master did. I suppose that could be avoided if the > standby always has its own encryption keys, but that forces a key > rotation when you create a standby, and it doesn't seem like a lot of > fun to insist on that. But the information leak seems minor. Which leaks seem minor? The "hole" issues leak all the prior contents of the hole, without needing any complicated analysis of the data, because one plain text is known (zeroes). Greetings, Andres Freund
Re: storing an explicit nonce
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > > I'm also not sure how much of the effort would really be duplicated. > > > > Were we to start with XTS, that's almost drop-in with what Bruce has > > (actually, it should simplify some parts since we no longer need to deal > > with making sure we always increase the LSN, etc) gives users more > > flexibility in terms of getting to an encrypted cluster and solves > > certain use-cases. Very little of that seems like it would be ripped > > out if we were to (also) provide a GCM option. > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't > > need to think about the XTS case of having to come up with a tweak > > (though that seems like a rather small amount of code) but that would > > also mean we need to change the page format and we can't do any kind of > > binary/page-level transistion to an encrypted cluster, like we could > > with XTS. > > > Trying to break it down, the end-goal states look like: > > > > GCM-only: no binary upgrade path due to having to store the tag > > XTS-only: no data integrity option > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > Why would GCM + XTS make sense? Especially if we were to go with > AES-GCM-SIV or something, drastically reducing the danger of nonce > reuse? You can't get to a GCM-based solution without changing the page format and therefore you can't get there using streaming replication or a pg_upgrade that does an encrypt step along with the copy. > And I don't think there's an easy way to do both using openssl, without > double encrypting, which we'd obviously not want for performance > reasons. And I don't think we'd want to implement either ourselves - > leaving other dangers aside, I don't think we want to do the > optimization work necessary to get good performance. E, clearly a misunderstanding here- what I'm suggesting is that we'd have initdb options where someone could initdb and say they want XTS, OR they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV). I'm not talking about doing both in the cluster at the same time.. Or, with XTS, we could have an option to pg_basebackup + encrypt into XTS to build an encrypted replica from an unencrypted cluster. There isn't any way we could do that with GCM though since we wouldn't have any place to put the tag. > > If we want both a binary upgrade path, and a data integrity option, then > > it seems like the only end state which provides both is GCM+XTS, in > > which case I don't think there's a lot of actual duplication. > > I honestly feel that Bruce's point about trying to shoot for the moon, > and thus not getting the basic feature done, applies much more to the > binary upgrade path than anything else. I think we should just stop > aiming for that for now. If we later want to add code that goes through > the cluster to ensure that there's enough space on each page for > integrity data, to provide a migration path, fine. But we shouldn't make > the binary upgrade path for TED a hard requirement. Ok, that's a pretty clear fundamental disagreement between you and Bruce. For my 2c, I tend to agree with you that the binary upgrade path isn't that critical. If we agree to forgo the binary upgrade requirement and are willing to accept Robert's approach to use the special area for the nonce+tag, or similar, then we could perhaps avoid the work of supporting XTS. > > > > For one, we'd probably want to get agreement on what we'd use to > > > > construct the tweak, for starters. > > > > > > Hm, isn't that just a pg_strong_random() and storing it encrypted? > > > > Perhaps it is, but at least in some other cases it's generated based on > > sector and block (which maybe could be relfilenode and block for us?): > > > > https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac > > My understanding is that you'd use > tweak_secret + block_offset > or > someop(tweak_secret, relfilenode) block_offset > > to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak. The above article, at least, suggested encrypting the sector number using the second key and then multiplying that times 2^(block number), where those blocks were actually AES 128bit blocks. The article further claims that this is what's used in things like Bitlocker, TrueCrypt, VeraCrypt and OpenSSL. While the documentation isn't super clear, I'm taking that to mean that when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it with a 256-bit key (twice the size of the AES key length function), and you give it a 'tweak', that what you would actually be passing in would be the "sector number" in the above method, or for us perhaps it would be relfilenode+block number, or maybe just block number but it seems like it'd be better to include the relfilenode to me. OpenSSL docs: https://www.open
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Alvaro Herrera writes: > Sorry, I'm unclear on exactly what are you proposing. Are you proposing > to rip out the fact that VACUUM FULL promises to recompress everything, > or are you proposing to rip out the whole attcompression feature? Just the former. regards, tom lane
Re: storing an explicit nonce
On 2021-May-27, Andres Freund wrote: > On 2021-05-27 15:48:09 -0400, Robert Haas wrote: > > Another case where this sort of thing might happen is a standby doing > > whatever the master did. I suppose that could be avoided if the > > standby always has its own encryption keys, but that forces a key > > rotation when you create a standby, and it doesn't seem like a lot of > > fun to insist on that. But the information leak seems minor. > > Which leaks seem minor? The "hole" issues leak all the prior contents of > the hole, without needing any complicated analysis of the data, because > one plain text is known (zeroes). Maybe that problem could be solved by having PageRepairFragmentation, compactify_tuples et al always fill the hole with zeroes, in encrypted databases. -- Álvaro Herrera Valdivia, Chile
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Alvaro Herrera writes: > Now about the former. If we do think that recompressing causes an > unacceptable 10% slowdown for every single VACUUM FULLs, then yeah we > should discuss changing that behavior -- maybe remove promises of > recompression and wait for pg15 to add "VACUUM (RECOMPRESS)" or > similar. > If it's a 10% slowdown of the only best times (variability unspecified) > and only in corner cases (unlogged tables with no indexes that fit in > shared buffers), then I don't think we should bother. BTW, perhaps I should clarify my goal here: it's to cut off expending further effort on this feature during v14. If we can decide that the existing performance situation is acceptable, I'm content with that decision. But if we're to start designing new user-visible behavior to satisfy performance objections, then I'd prefer to remove this VACUUM behavior altogether for now. regards, tom lane
Re: storing an explicit nonce
Hi, On 2021-05-27 16:09:13 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > > > I'm also not sure how much of the effort would really be duplicated. > > > > > > Were we to start with XTS, that's almost drop-in with what Bruce has > > > (actually, it should simplify some parts since we no longer need to deal > > > with making sure we always increase the LSN, etc) gives users more > > > flexibility in terms of getting to an encrypted cluster and solves > > > certain use-cases. Very little of that seems like it would be ripped > > > out if we were to (also) provide a GCM option. > > > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't > > > need to think about the XTS case of having to come up with a tweak > > > (though that seems like a rather small amount of code) but that would > > > also mean we need to change the page format and we can't do any kind of > > > binary/page-level transistion to an encrypted cluster, like we could > > > with XTS. > > > > > Trying to break it down, the end-goal states look like: > > > > > > GCM-only: no binary upgrade path due to having to store the tag > > > XTS-only: no data integrity option > > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > > [...] > > And I don't think there's an easy way to do both using openssl, without > > double encrypting, which we'd obviously not want for performance > > reasons. And I don't think we'd want to implement either ourselves - > > leaving other dangers aside, I don't think we want to do the > > optimization work necessary to get good performance. > > E, clearly a misunderstanding here- what I'm suggesting is that we'd > have initdb options where someone could initdb and say they want XTS, OR > they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV). I'm > not talking about doing both in the cluster at the same time.. Ah, that makes more sense ;). So the end goal states are the different paths we could take? > Still leaves us with the question of what exactly we should pass into > OpenSSL as the 'tweak', if it should be the block offset inside the > file only, or the block offset + relfilenode, or something else. I think it has to include the relfilenode as a minimum. It'd not be great if you could identify equivalent blocks in different tables. It might even be worth complicating createdb() a bit and including the dboid as well. Greetings, Andres Freund
Re: storing an explicit nonce
Hi, On 2021-05-27 16:13:44 -0400, Alvaro Herrera wrote: > Maybe that problem could be solved by having PageRepairFragmentation, > compactify_tuples et al always fill the hole with zeroes, in encrypted > databases. If that were the only issue, maybe. But there's plenty other places were similar things happen. Look at all the stuff that needs to be masked out for wal consistency checking (checkXLogConsistency() + all the things it calls). And there's no way proposed to actually have a maintainable way of detecting omissions around this. Greetings, Andres Freund
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Peter Geoghegan writes: > On Thu, May 27, 2021 at 7:25 AM Robert Haas wrote: >> To say that nobody cares about that is to deem the >> feature useless. Maybe that's what Tom thinks, but it's not what I >> think. > I don't think that follows at all. Yeah. My belief here is that users might bother to change default_toast_compression, or that we might do it for them in a few years, but the gains from doing so are going to be only incremental. That being the case, most DBAs will be content to allow the older compression method to age out of their databases through routine row updates. The idea that somebody is going to be excited enough about this to run a downtime-inducing VACUUM FULL doesn't really pass the smell test. That doesn't make LZ4 compression useless, by any means, but it does suggest that we shouldn't be adding overhead to VACUUM FULL for the purpose of easing instantaneous switchovers. I'll refrain from re-telling old war stories about JPEG/GIF/PNG, but I do have real-world experience with compression algorithm changes. IME you need an integer-multiples type of improvement to get peoples' attention, and LZ4 isn't going to offer that, except maybe in cherry-picked examples. regards, tom lane
Re: Support for NSS as a libpq TLS backend
> On 25 Mar 2021, at 00:56, Jacob Champion wrote: > Databases that are opened *after* the first one are given their own separate > slots. Any certificates that are part of those databases seemingly can't be > referenced directly by nickname. They have to be prefixed by their token name > -- a name which you don't have if you used NSS_InitContext() to create the > database. You have to use SECMOD_OpenUserDB() instead. This explains some > strange failures I was seeing in local testing, where the order of > InitContext determined whether our client certificate selection succeeded or > failed. Sorry for the latency is responding, but I'm now back from parental leave. AFAICT the tokenname for the database can be set with the dbTokenDescription member in the NSSInitParameters struct passed to NSS_InitContext() (documented in nss.h). Using this we can avoid the messier SECMOD machinery and use the token in the auth callback to refer to the database we loaded. I hacked this up in my local tree (rebased patchset coming soon) and it seems to work as intended. -- Daniel Gustafsson https://vmware.com/
Re: storing an explicit nonce
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2021-05-27 16:09:13 -0400, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > > > > I'm also not sure how much of the effort would really be duplicated. > > > > > > > > Were we to start with XTS, that's almost drop-in with what Bruce has > > > > (actually, it should simplify some parts since we no longer need to deal > > > > with making sure we always increase the LSN, etc) gives users more > > > > flexibility in terms of getting to an encrypted cluster and solves > > > > certain use-cases. Very little of that seems like it would be ripped > > > > out if we were to (also) provide a GCM option. > > > > > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't > > > > need to think about the XTS case of having to come up with a tweak > > > > (though that seems like a rather small amount of code) but that would > > > > also mean we need to change the page format and we can't do any kind of > > > > binary/page-level transistion to an encrypted cluster, like we could > > > > with XTS. > > > > > > > Trying to break it down, the end-goal states look like: > > > > > > > > GCM-only: no binary upgrade path due to having to store the tag > > > > XTS-only: no data integrity option > > > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > > > > [...] > > > And I don't think there's an easy way to do both using openssl, without > > > double encrypting, which we'd obviously not want for performance > > > reasons. And I don't think we'd want to implement either ourselves - > > > leaving other dangers aside, I don't think we want to do the > > > optimization work necessary to get good performance. > > > > E, clearly a misunderstanding here- what I'm suggesting is that we'd > > have initdb options where someone could initdb and say they want XTS, OR > > they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV). I'm > > not talking about doing both in the cluster at the same time.. > > Ah, that makes more sense ;). So the end goal states are the different > paths we could take? The end goals are different possible things we could provide support for, not in one cluster, but in one build of PG. That is, we could add support in v15 (or whatever) for: initdb --encryption-type=AES-XTS and then in v16 add support for: initdb --encryption-type=AES-GCM (or AES-GCM-SIV, whatever) while keeping support for AES-XTS. Users who just want encryption could go do a binary upgrade of some kind to a cluster which has AES-XTS encryption, but to get GCM they'd have to initialize a new cluster and migrate data to it using logical replication or pg_dump/restore. There's also been requests for other possible encryption options, so I don't think these would even be the only options eventually, though I do think we'd probably have them broken down into "just encryption" or "encryption + data integrity" with the same resulting limitations regarding the ability to do binary upgrades. > > Still leaves us with the question of what exactly we should pass into > > OpenSSL as the 'tweak', if it should be the block offset inside the > > file only, or the block offset + relfilenode, or something else. > > I think it has to include the relfilenode as a minimum. It'd not be > great if you could identify equivalent blocks in different tables. It > might even be worth complicating createdb() a bit and including the > dboid as well. At this point I'm wondering if it's just: dboid/relfilenode:block-offset and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128) (or -256, whatever we're using based on what was passed to initdb) returns. Thanks, Stephen signature.asc Description: PGP signature
Re: storing an explicit nonce
On Thu, May 27, 2021 at 4:04 PM Andres Freund wrote: > On 2021-05-27 15:48:09 -0400, Robert Haas wrote: > > That said, I want to mention a point which I think may be relevant > > here. As far as I know, in the case of a permanent table page, we > > never write X then X' then X again. > > Well, there's crash recovery / restarts. And as previously explained > they can end up with different page contents than before. Right, I'm not trying to oversell this point ... if in system XYZ there's a serious security exposure from ever repeating a page write, we should not use system XYZ unless we do some work to make sure that every page write is different. But if we just think it would be nicer if page writes didn't repeat, that's probably *mostly* true today already. > I don't see how the fake LSN approach can work for the crash recovery > issues? I wasn't trying to say it could. You've convinced me on that point. > > Or we could use the special-space technique to put some unpredictable > > garbage into each page and then change the garbage every time we write > > the page > > Unfortunately with CTR mode that doesn't provide much protection, if > it's part of the encrypted data (vs IV/nonce). A one bit change in the > encrypted data only changes one bit in the unencrypted data, as the data > is just XORd with the cipher stream. So random changes in one place > doesn't prevent disclosure in other parts of the data if the nonce > doesn't also change. And one can easily predict the effect of flipping > certain bits. Yeah, I wasn't talking about CTR mode there. I was just saying if we wanted to avoid ever repeating a write. > > Another case where this sort of thing might happen is a standby doing > > whatever the master did. I suppose that could be avoided if the > > standby always has its own encryption keys, but that forces a key > > rotation when you create a standby, and it doesn't seem like a lot of > > fun to insist on that. But the information leak seems minor. > > Which leaks seem minor? The "hole" issues leak all the prior contents of > the hole, without needing any complicated analysis of the data, because > one plain text is known (zeroes). No. You're confusing what I was saying here, in the contents of your comments about the limitations of AES-GCM-SIV, with the discussion with Bruce about nonce generation. -- Robert Haas EDB: http://www.enterprisedb.com
Re: storing an explicit nonce
On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote: > The above article, at least, suggested encrypting the sector number > using the second key and then multiplying that times 2^(block number), > where those blocks were actually AES 128bit blocks. The article further > claims that this is what's used in things like Bitlocker, TrueCrypt, > VeraCrypt and OpenSSL. > > While the documentation isn't super clear, I'm taking that to mean that > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it > with a 256-bit key (twice the size of the AES key length function), and > you give it a 'tweak', that what you would actually be passing in would > be the "sector number" in the above method, or for us perhaps it would > be relfilenode+block number, or maybe just block number but it seems > like it'd be better to include the relfilenode to me. If you go in that direction, you should make sure pg_upgrade preserves what you use (it does not preserve relfilenode, just pg_class.oid), and CREATE DATABASE still works with a simple file copy. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote: > > The above article, at least, suggested encrypting the sector number > > using the second key and then multiplying that times 2^(block number), > > where those blocks were actually AES 128bit blocks. The article further > > claims that this is what's used in things like Bitlocker, TrueCrypt, > > VeraCrypt and OpenSSL. > > > > While the documentation isn't super clear, I'm taking that to mean that > > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it > > with a 256-bit key (twice the size of the AES key length function), and > > you give it a 'tweak', that what you would actually be passing in would > > be the "sector number" in the above method, or for us perhaps it would > > be relfilenode+block number, or maybe just block number but it seems > > like it'd be better to include the relfilenode to me. > > If you go in that direction, you should make sure pg_upgrade preserves > what you use (it does not preserve relfilenode, just pg_class.oid), and > CREATE DATABASE still works with a simple file copy. Ah, yes, good point, if we support in-place pg_upgrade of an encrypted cluster then the tweak has to be consistent between the old and new. I tend to agree with Andres that it'd be reasonable to make CREATE DATABASE do a bit more work for an encrypted cluster though, so I'm less concerned about that. Using pg_class.oid instead of relfilenode seems likely to complicate things like crash recovery though, wouldn't it? I wonder if there's something else we could use. Thanks, Stephen signature.asc Description: PGP signature
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Tue, May 25, 2021 at 08:33:47PM -0500, Justin Pryzby wrote: > On Sun, May 23, 2021 at 12:25:10PM -0400, Tom Lane wrote: > > However, the more I looked at that code the less I liked it. > > I think the way that compression selection is handled for indexes, > > ie consult default_toast_compression on-the-fly, is *far* saner > > than what is currently implemented for tables. So I think we > > should redefine attcompression as "ID of a compression method > > to use, or \0 to select the prevailing default. Ignored if > > attstorage does not permit the use of compression". > > +1 > > It reminds me of reltablespace, which is stored as 0 to mean the database's > default tablespace. I was surprised to realize that I made this same suggestion last month... https://www.postgresql.org/message-id/20210320074420.gr11...@telsasoft.com |..unless we changed attcompression='\0' to mean (for varlena) "the default |compression". Rather than "resolving" to the default compression at the time |the table is created, columns without an explicit compression set would "defer" |to the GUC (of course, that only affects newly-inserted data). The original reason for that suggestion Michael handled differently in 63db0ac3f9e6bae313da67f640c95c0045b7f0ee -- Justin
Re: storing an explicit nonce
Hi, On 2021-05-27 16:55:29 -0400, Robert Haas wrote: > No. You're confusing what I was saying here, in the contents of your > comments about the limitations of AES-GCM-SIV, with the discussion > with Bruce about nonce generation. Ah. I think the focus on LSNs confused me a bit. FWIW: Nist guidance on IVs for AES GCM (surprisingly readable): https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf AES-GCM-SIV (harder to read): https://eprint.iacr.org/2017/168.pdf Greetings, Andres Freund
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Thu, May 27, 2021 at 1:29 PM Tom Lane wrote: > Yeah. My belief here is that users might bother to change > default_toast_compression, or that we might do it for them in a few > years, but the gains from doing so are going to be only incremental. > That being the case, most DBAs will be content to allow the older > compression method to age out of their databases through routine row > updates. The idea that somebody is going to be excited enough about > this to run a downtime-inducing VACUUM FULL doesn't really pass the > smell test. That was my original understanding of your position, FWIW. I agree with all of this. > That doesn't make LZ4 compression useless, by any means, but it does > suggest that we shouldn't be adding overhead to VACUUM FULL for the > purpose of easing instantaneous switchovers. Right. More generally, there often seems to be value in under-specifying what a compression option does. Or in treating them as advisory. You mentioned the history of SET STORAGE, which seems very relevant. I am reminded of the example of B-Tree deduplication with unique indexes, where we selectively apply the optimization based on page-level details. Deduplication isn't usually useful in unique indexes (for the obvious reason), though occasionally it is extremely useful. I think that there might be a variety of things that work a little like that. It can help with avoiding unnecessary dump and reload hazards, too. I am interested in hearing the *principle* behind Robert's position. This whole area seems like something that might have at least a couple of different schools of thought. If it is then I'd sincerely like to hear the other side of the argument. -- Peter Geoghegan
Re: Replacing pg_depend PIN entries with a fixed range check
I wrote: > Robert Haas writes: >> The first hunk of the patch seems to back away from the idea that the >> cutoff is 13000, but the second half of the patch says 13000 still >> matters. Not sure I understand what's going on there exactly. > Not sure exactly what you're looking at, but IIRC there is a place > where the patch is cleaning up after ab596105b's failure to adjust > bki.sgml to match its change of FirstBootstrapObjectId from 12000 > to 13000. I hadn't bothered to fix that separately, but I guess > we should do so, else v14 is going to ship with incorrect docs. I take that back: I had committed that doc fix, in 1f9b0e693, so I'm still unsure what was confusing you. (But a4390abec just reverted it, anyway.) Attached is a rebase over a4390abec. The decision in that commit to not expect global uniqueness of OIDs above 10K frees us to use a much simpler solution than before: we can just go ahead and start the backend's OID counter at 1, and not worry about conflicts, because the OID generation logic can deal with any conflicts just fine as long as you're okay with only having per-catalog uniqueness. So this gets rid of the set_next_oid mechanism that I'd invented in v2, and yet there's still no notable risk of running out of OIDs in the 10K-12K range. While testing this, I discovered something that I either never knew or had forgotten: the bootstrap backend is itself assigning some OIDs, specifically OIDs for the composite types associated with most of the system catalogs (plus their array types). I find this scary, because it is happening before we've built the catalog indexes, so it's impossible to ensure uniqueness. (Of course, when we do build the indexes, we'd notice any conflicts; but that's not a solution.) I think it accidentally works because we don't ask genbki.pl to assign any pg_type OIDs, but that seems fragile. Seems like maybe we should fix genbki.pl to assign those OIDs, and then change GetNewOidWithIndex to error out in bootstrap mode. However that's a pre-existing issue, so I don't feel that this patch needs to be the one to fix it. regards, tom lane diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml index db1b3d5e9a..f32208b550 100644 --- a/doc/src/sgml/bki.sgml +++ b/doc/src/sgml/bki.sgml @@ -418,19 +418,33 @@ If genbki.pl needs to assign an OID to a catalog entry that does not have a manually-assigned OID, it will use a value in -the range 1—11999. The server's OID counter is set to 12000 -at the start of a bootstrap run. Thus objects created by regular SQL -commands during the later phases of bootstrap, such as objects created -while running the information_schema.sql script, -receive OIDs of 12000 or above. +the range 1—11999. The server's OID counter is set to 1 +at the start of a bootstrap run, so that any objects created on-the-fly +during bootstrap processing also receive OIDs in this range. (The +usual OID assignment mechanism takes care of preventing any conflicts.) + + + +Objects with OIDs below FirstUnpinnedObjectId (12000) +are considered pinned, preventing them from being +deleted. (There are a small number of exceptions, which are +hard-wired into IsPinnedObject().) +initdb forces the OID counter up +to FirstUnpinnedObjectId as soon as it's ready to +create unpinned objects. Thus objects created during the later phases +of initdb, such as objects created while +running the information_schema.sql script, will +not be pinned, while all objects known +to genbki.pl will be. OIDs assigned during normal database operation are constrained to be 16384 or higher. This ensures that the range 1—16383 is free for OIDs assigned automatically by genbki.pl or -during bootstrap. These automatically-assigned OIDs are not considered -stable, and may change from one installation to another. +during initdb. These +automatically-assigned OIDs are not considered stable, and may change +from one installation to another. diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 16493209c6..d4e1a13746 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -3264,8 +3264,7 @@ SCRAM-SHA-256$:&l (references pg_class.oid) - The OID of the system catalog the dependent object is in, - or zero for a DEPENDENCY_PIN entry + The OID of the system catalog the dependent object is in @@ -3275,8 +3274,7 @@ SCRAM-SHA-256$ :&l (references any OID column) - The OID of the specific dependent object, - or zero for a DEPENDENCY_PIN entry + The OID of the specific dependent object @@ -3467,19 +3465,6 @@ SCRAM-SHA-256$ :&l - - - DEPENDENCY_PIN (p) - -
Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns
On 2019-Mar-15, Etsuro Fujita wrote: > (2019/03/04 12:10), Etsuro Fujita wrote: > > (2019/03/02 3:57), Andres Freund wrote: > > > FWIW, I pushed the EPQ patch, doing this conversion blindly. It'd be > > > awesome if you'd check that it actually works... > > > > I'll start the work later this week. I think I can post an (initial) > > report on that next week, maybe. > > Here is an updated version of the patch [1]. What happened to this patch? Seems it was forgotten ... -- Álvaro Herrera39°49'30"S 73°17'W
Re: Teaching users how they can get the most out of HOT in Postgres 14
On Sun, May 23, 2021 at 11:34 PM Masahiko Sawada wrote: > I think the possible side effect of this hard-coded > BYPASS_THRESHOLD_PAGES would be that by default, bulkdelete is not > called for a long term and the index becomes bloat. What do you think of the approach taken in the attached POC patch? The patch makes it possible to disable the optimization by generalizing the INDEX_CLEANUP reloption to be an enum that looks like a trinay boolean (not just a plain boolean). INDEX_CLEANUP now accepts the values 'auto', 'on', and 'off' (plus a variety of alternative spellings, the usual ones for booleans in Postgres). Now 'auto' is the default, and 'on' forces the previous behavior inside vacuumlazy.c. It does not disable the failsafe, though -- INDEX_CLEANUP remains a fairly mechanical thing. This approach seems good to me because INDEX_CLEANUP remains consistent with the original purpose and design of INDEX_CLEANUP -- that was always an option that forced VACUUM to do something special with indexes. I don't see much downside to this approach, either. As things stand, INDEX_CLEANUP is mostly superseded by the failsafe, so we don't really need to talk about wraparound emergencies in the docs for INDEX_CLEANUP anymore. This seems much more elegant than either repurposing/reviving cleanup_index_scale_factor (which makes no sense to me at all) or inventing a new reloption (which would itself be in tension with INDEX_CLEANUP). There are some practical issues that make this patch surprisingly complicated for such a simple problem. For example, I hope that I haven't missed any subtlety in generalizing a boolean reloption like this. We've done similar things with GUCs in the past, but this may be a little different. Another concern with this approach is what it means for the VACUUM command itself. I haven't added an 'auto' spelling that is accepted by the VACUUM command in this POC version. But do I need to at all? Can that just be implied by not having any INDEX_CLEANUP option? And does StdRdOptions.vacuum_truncate now need to become a VacOptTernaryValue field too, for consistency with the new definition of StdRdOptions.vacuum_index_cleanup? -- Peter Geoghegan 0001-Generalize-VACUUM-s-INDEX_CLEANUP-option.patch Description: Binary data
RE: Parallel Inserts in CREATE TABLE AS
From: Bharath Rupireddy > I'm still not sure why the execution time with 0 workers (or serial execution > or > no parallelism involved) on my testing system is 112 sec compared to 58 sec on > Hou-San's system for the same use case. Maybe the testing system I'm using > is not of the latest configuration compared to others. What's the setting of wal_level on your two's systems? I thought it could be that you set it to > minimal, while Hou-san set it to minimal. (I forgot the results of 2 and 4 workers, though.) Regards Takayuki Tsunakawa