Possible optimization of heap_lock_tuple()
It seems that a concurrent UPDATE can restart heap_lock_tuple() even if it's not necessary. Is the attached proposal correct and worth applying? -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 6ac07f2fda..c3310ab4f9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4553,7 +4553,7 @@ l3: */ if (!HeapTupleHeaderIsOnlyLocked(tuple->t_data) && ((tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED) || - !updated)) + (!updated && follow_updates))) goto l3; /* Things look okay, so we can skip sleeping */
Re: Batch insert in CTAS/MatView code
On Wed, May 26, 2021 at 12:18 PM tsunakawa.ta...@fujitsu.com wrote: > > Hello Paul-san, > > From: Daniel Gustafsson > > In an off-list discussion with Paul, we decided to withdraw this patch for > > now > > and instead create a new entry when there is a re-worked patch. This has > > now > > been done in the CF app. > > Would you mind if I take over this patch for PG 15? I find this promising, > as Bharath-san demonstrated good performance by combining your patch and the > parallel CTAS that Bharath-san has been working on. We'd like to do things > that enhance parallelism. > > Please allow me to start posting the revised patches next week if I can't get > your reply. Hi, I think the "New Table Access Methods for Multi and Single Inserts" patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat View, Refresh Mat View and so on. It also has a patch for multi inserts in CTAS and Refresh Mat View (v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch). Please see that thread and feel free to review it. [1] - https://www.postgresql.org/message-id/CALj2ACXdrOmB6Na9amHWZHKvRT3Z0nwTRsCwoMT-npOBtmXLXg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
RE: Batch insert in CTAS/MatView code
From: Bharath Rupireddy > I think the "New Table Access Methods for Multi and Single Inserts" > patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat > View, Refresh Mat View and so on. It also has a patch for multi > inserts in CTAS and Refresh Mat View > (v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch). > Please see that thread and feel free to review it. Ouch, I didn't notice that the patch was reborn in that thread. OK. Could you add it to the CF if you haven't yet? Regards Takayuki Tsunakawa
Re: Batch insert in CTAS/MatView code
On Wed, May 26, 2021 at 12:49 PM tsunakawa.ta...@fujitsu.com wrote: > > From: Bharath Rupireddy > > I think the "New Table Access Methods for Multi and Single Inserts" > > patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat > > View, Refresh Mat View and so on. It also has a patch for multi > > inserts in CTAS and Refresh Mat View > > (v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch). > > Please see that thread and feel free to review it. > > Ouch, I didn't notice that the patch was reborn in that thread. OK. > > Could you add it to the CF if you haven't yet? It already has one - https://commitfest.postgresql.org/33/2871/ With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Fix typo: multiple tuple => tuples
Hi, I found a possible typo in the code comments of heap_multi_insert. - * heap_multi_insert - insert multiple tuple into a heap + * heap_multi_insert - insert multiple tuples into a heap Attaching a patch to fix it. Best regards, houzj 0001-fix-typo.patch Description: 0001-fix-typo.patch
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
On Thu, Apr 22, 2021 at 01:36:03PM -0700, Andres Freund wrote: > The sequence in StartupXLOG() leading to the issue is the following: > > 1) RecoverPreparedTransactions(); > 2) ShutdownRecoveryTransactionEnvironment(); > 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE; > > Because 2) resets the KnownAssignedXids machinery, snapshots that happen > between 2) and 3) will not actually look at the procarray to compute > snapshots, as that only happens within > > snapshot->takenDuringRecovery = RecoveryInProgress(); > if (!snapshot->takenDuringRecovery) > > and RecoveryInProgress() checks XLogCtl->SharedRecoveryState != > RECOVERY_STATE_DONE, which is set in 3). Oh, indeed. It is easy to see RecentXmin jumping back-and-worth while running 023_pitr_prepared_xact.pl with a small sleep added just after ShutdownRecoveryTransactionEnvironment(). > Without prepared transaction there probably isn't an issue, because there > shouldn't be any other in-progress xids at that point? Yes, there should not be any as far as I recall. 2PC is kind of special with its fake ProcArray entries. > I think to fix the issue we'd have to move > ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState > = RECOVERY_STATE_DONE. > > The acquisition of ProcArrayLock() in > ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds() > should prevent the data from being removed between the RecoveryInProgress() > and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData(). > > I haven't yet figured out whether there would be a problem with deferring the > other tasks in ShutdownRecoveryTransactionEnvironment() until after > RECOVERY_STATE_DONE. Hmm. This would mean releasing all the exclusive locks tracked by a standby, as of StandbyReleaseAllLocks(), after opening the instance for writes after a promotion. I don't think that's unsafe, but it would be intrusive. Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself, where we should try to not wipe out the 2PC entries to make sure that all those snapshots still see the 2PC transactions as something to count on? I am attaching a crude patch to show the idea. Thoughts? -- Michael diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 42a89fc5dc..2e819163c4 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -4853,10 +4853,24 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid) if (!TransactionIdIsValid(removeXid)) { - elog(trace_recovery(DEBUG4), "removing all KnownAssignedXids"); - pArray->numKnownAssignedXids = 0; - pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0; - return; + tail = pArray->tailKnownAssignedXids; + head = pArray->headKnownAssignedXids; + + for (i = tail; i < head; i++) + { + if (KnownAssignedXidsValid[i]) + { +TransactionId knownXid = KnownAssignedXids[i]; + +if (!StandbyTransactionIdIsPrepared(knownXid)) +{ + KnownAssignedXidsValid[i] = false; + count++; +} + } + } + + goto recompute; } elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", removeXid); @@ -4885,6 +4899,9 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid) } } + /* fall through */ +recompute: + pArray->numKnownAssignedXids -= count; Assert(pArray->numKnownAssignedXids >= 0); signature.asc Description: PGP signature
Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay
On Wed, Apr 14, 2021 at 11:17 PM Mead, Scott wrote: > > > > > On Mar 1, 2021, at 8:43 PM, Masahiko Sawada wrote: > > > > CAUTION: This email originated from outside of the organization. Do not > > click links or open attachments unless you can confirm the sender and know > > the content is safe. > > > > > > > > On Mon, Feb 8, 2021 at 11:49 PM Mead, Scott wrote: > >> > >> Hello, > >> I recently looked at what it would take to make a running autovacuum > >> pick-up a change to either cost_delay or cost_limit. Users frequently > >> will have a conservative value set, and then wish to change it when > >> autovacuum initiates a freeze on a relation. Most users end up finding > >> out they are in ‘to prevent wraparound’ after it has happened, this means > >> that if they want the vacuum to take advantage of more I/O, they need to > >> stop and then restart the currently running vacuum (after reloading the > >> GUCs). > >> > >> Initially, my goal was to determine feasibility for making this dynamic. > >> I added debug code to vacuum.c:vacuum_delay_point(void) and found that > >> changes to cost_delay and cost_limit are already processed by a running > >> vacuum. There was a bug preventing the cost_delay or cost_limit from > >> being configured to allow higher throughput however. > >> > >> I believe this is a bug because currently, autovacuum will dynamically > >> detect and increase the cost_limit or cost_delay, but it can never > >> decrease those values beyond their setting when the vacuum began. The > >> current behavior is for vacuum to limit the maximum throughput of > >> currently running vacuum processes to the cost_limit that was set when the > >> vacuum process began. > > > > Thanks for your report. > > > > I've not looked at the patch yet but I agree that the calculation for > > autovacuum cost delay seems not to work fine if vacuum-delay-related > > parameters (e.g., autovacuum_vacuum_cost_delay etc) are changed during > > vacuuming a table to speed up running autovacuums. Here is my > > analysis: > > > I appreciate your in-depth analysis and will comment in-line. That said, I > still think it’s important that the attached path is applied. As it is > today, a simple few lines of code prevent users from being able to increase > the throughput on vacuums that are running without having to cancel them > first. > > The patch that I’ve provided allows users to decrease their vacuum_cost_delay > and get an immediate boost in performance to their running vacuum jobs. > > > > > > Suppose we have the following parameters and 3 autovacuum workers are > > running on different tables: > > > > autovacuum_vacuum_cost_delay = 100 > > autovacuum_vacuum_cost_limit = 100 > > > > Vacuum cost-based delay parameters for each workers are follows: > > > > worker->wi_cost_limit_base = 100 > > worker->wi_cost_limit = 66 > > worker->wi_cost_delay = 100 Sorry, worker->wi_cost_limit should be 33. > > > > Each running autovacuum has "wi_cost_limit = 66" because the total > > limit (100) is equally rationed. And another point is that the total > > wi_cost_limit (198 = 66*3) is less than autovacuum_vacuum_cost_limit, > > 100. Which are fine. So the total wi_cost_limit, 99, is less than autovacuum_vacuum_cost_limit, 100. > > > > Here let's change autovacuum_vacuum_cost_delay/limit value to speed up > > running autovacuums. > > > > Case 1 : increasing autovacuum_vacuum_cost_limit to 1000. > > > > After reloading the configuration file, vacuum cost-based delay > > parameters for each worker become as follows: > > > > worker->wi_cost_limit_base = 100 > > worker->wi_cost_limit = 100 > > worker->wi_cost_delay = 100 > > > > If we rationed autovacuum_vacuum_cost_limit, 1000, to 3 workers, it > > would be 333. But since we cap it by wi_cost_limit_base, the > > wi_cost_limit is 100. I think this is what Mead reported here. > > > Yes, this is exactly correct. The cost_limit is capped at the cost_limit > that was set during the start of a running vacuum. My patch changes this cap > to be the max allowed cost_limit (10,000). The comment of worker's limit calculation says: /* * We put a lower bound of 1 on the cost_limit, to avoid division- * by-zero in the vacuum code. Also, in case of roundoff trouble * in these calculations, let's be sure we don't ever set * cost_limit to more than the base value. */ worker->wi_cost_limit = Max(Min(limit, worker->wi_cost_limit_base), 1); If we use the max cost_limit as the upper bound here, the worker's limit could unnecessarily be higher than the base value in case of roundoff trouble? I think that the problem here is rather that we don't update wi_cost_limit_base and wi_cost_delay when rebalancing the cost. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Assertion failure while streaming toasted data
On Wed, May 26, 2021 at 11:53 AM Dilip Kumar wrote: > > On Wed, May 26, 2021 at 11:19 AM Amit Kapila wrote: > > > > > I searched and didn't find any similar existing tests. Can we think of > > any other way to test this code path? We already have one copy test in > > toast.sql, isn't it possible to write a similar test here? > > I will check that and let you know. I have followed this approach to write the test. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From af116766c8562c36eb00c836505fe4725d19f81d Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Tue, 25 May 2021 14:51:45 +0530 Subject: [PATCH v3] Fix bug while streaming the multi-insert toast changes While processing the multi-insert we can not clean the toast untill we get the last insert of the multi-insert. So mark the transaction incomplete for streaming if we get any multi-insert change, and keep it set until we get the last tuple of the multi-insert. --- contrib/test_decoding/expected/stream.out | 24 + contrib/test_decoding/sql/stream.sql| 18 + src/backend/replication/logical/reorderbuffer.c | 36 +++-- src/include/replication/reorderbuffer.h | 27 +-- 4 files changed, 64 insertions(+), 41 deletions(-) diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out index e1c3bc8..0f21dcb 100644 --- a/contrib/test_decoding/expected/stream.out +++ b/contrib/test_decoding/expected/stream.out @@ -82,6 +82,30 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl committing streamed transaction (13 rows) +-- streaming test for toast with multi-insert +\COPY stream_test FROM STDIN +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); + data +-- + opening a streamed block for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + streaming change for transaction + closing a streamed block for transaction + opening a streamed block for transaction + streaming change for transaction + closing a streamed block for transaction + committing streamed transaction +(17 rows) + DROP TABLE stream_test; SELECT pg_drop_replication_slot('regression_slot'); pg_drop_replication_slot diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql index ce86c81..4feec62 100644 --- a/contrib/test_decoding/sql/stream.sql +++ b/contrib/test_decoding/sql/stream.sql @@ -26,5 +26,23 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i); SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); +-- streaming test for toast with multi-insert +\COPY stream_test FROM STDIN +toasted-1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
Re: locking [user] catalog tables vs 2pc vs logical rep
On 26 May 2021, at 05:04, Amit Kapila wrote: > > On Tue, May 25, 2021 at 12:40 PM Michael Paquier wrote: >> >> On Mon, May 24, 2021 at 10:03:01AM +0530, Amit Kapila wrote: >>> So, this appears to be an existing caveat of synchronous replication. >>> If that is the case, I am not sure if it is a good idea to just block >>> such ops for the prepared transaction. Also, what about other >>> operations which acquire an exclusive lock on [user]_catalog_tables >>> like: >>> cluster pg_trigger using pg_class_oid_index, similarly cluster on any >>> user_catalog_table, then the other problematic operation could >>> truncate of user_catalog_table as is discussed in another thread [1]. >>> I think all such operations can block even with synchronous >>> replication. I am not sure if we can create examples for all cases >>> because for ex. we don't have use of user_catalog_tables in-core but >>> maybe for others, we can try to create examples and see what happens? >>> >>> If all such operations can block for synchronous replication and >>> prepared transactions replication then we might want to document them >>> as caveats at page: >>> https://www.postgresql.org/docs/devel/logicaldecoding-synchronous.html >>> and then also give the reference for these caveats at prepared >>> transactions >>> page:https://www.postgresql.org/docs/devel/logicaldecoding-two-phase-commits.html >>> >>> What do you think? >> >> It seems to me that the 2PC issues on catalog tables and the issues >> related to logical replication in synchonous mode are two distinct >> things that need to be fixed separately. >> > > Fair enough. But the way we were looking at them as they will also > block (lead to deadlock) for logical replication of prepared > transactions and also logical replication in synchonous mode without > prepared transactions. Now, if we want to deal with the 2PC issues > separately that should be fine as well. However, for that we need to > see which all operations we want to block on [user]_catalog_tables. > The first one is lock command, then there are other operations like > Cluster which take exclusive lock on system catalog tables and we > allow them to be part of prepared transactions (example Cluster > pg_trigger using pg_trigger_oid_index;), another kind of operation is > Truncate on user_catalog_tables. Now, some of these might not allow > connecting after restart so we might need to think whether we want to > prohibit all such operations or only some of them. > IIRC this was discussed the first time 2PC decoding was proposed and everybody seemed fine with the limitation so I'd vote for just documenting it, same way as the sync rep issue. If you'd prefer fixing it by blocking something, wouldn't it make more sense to simply not allow PREPARE if one of these operations was executed, similarly to what we do with temp table access? -- Petr
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, May 25, 2021 at 8:54 AM Ajin Cherian wrote: > > On Fri, May 21, 2021 at 6:43 PM Peter Smith wrote: > > > Fixed in v77-0001, v77-0002 > > Attaching a new patch-set that rebases the patch, addresses review > comments from Peter as well as a test failure reported by Tang. I've > also added some new test case into patch-2 authored by Tang. Thanks for the updated patch, few comments: 1) Should "The end LSN of the prepare." be changed to "end LSN of the prepare transaction."? --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -7538,6 +7538,13 @@ are available since protocol version 3. Int64 +The end LSN of the prepare. + + + + +Int64 + 2) Should the ";" be "," here? +++ b/doc/src/sgml/catalogs.sgml @@ -7639,6 +7639,18 @@ SCRAM-SHA-256$:&l + subtwophasestate char + + + State code: + d = two_phase mode was not requested, so is disabled; + p = two_phase mode was requested, but is pending enablement; + e = two_phase mode was requested, and is enabled. + 3) Should end_lsn be commit_end_lsn? + prepare_data->commit_end_lsn = pq_getmsgint64(in); + if (prepare_data->commit_end_lsn == InvalidXLogRecPtr) elog(ERROR, "end_lsn is not set in commit prepared message"); + prepare_data->prepare_time = pq_getmsgint64(in); 4) This change is not required diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h index 0dc460f..93c6731 100644 --- a/src/include/replication/pgoutput.h +++ b/src/include/replication/pgoutput.h @@ -29,5 +29,4 @@ typedef struct PGOutputData boolmessages; booltwo_phase; } PGOutputData; - #endif /* PGOUTPUT_H */ 5) Will the worker receive commit prepared/rollback prepared as we have skip logic to skip commit prepared / commit rollback in pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn: +* It is possible that we haven't received the prepare because +* the transaction did not have any changes relevant to this +* subscription and was essentially an empty prepare. In which case, +* the walsender is optimized to drop the empty transaction and the +* accompanying prepare. Silently ignore if we don't find the prepared +* transaction. */ - replorigin_session_origin_lsn = prepare_data.end_lsn; - replorigin_session_origin_timestamp = prepare_data.commit_time; + if (LookupGXact(gid, prepare_data.prepare_end_lsn, + prepare_data.prepare_time)) + { 6) I'm not sure if we could add some tests for skip empty prepare transactions, if possible add few tests. 7) We could add some debug level log messages for the transaction that will be skipped. Regards, Vignesh
Re: locking [user] catalog tables vs 2pc vs logical rep
On Tue, May 25, 2021 at 12:40 PM Michael Paquier wrote: > > On Mon, May 24, 2021 at 10:03:01AM +0530, Amit Kapila wrote: > > So, this appears to be an existing caveat of synchronous replication. > > If that is the case, I am not sure if it is a good idea to just block > > such ops for the prepared transaction. Also, what about other > > operations which acquire an exclusive lock on [user]_catalog_tables > > like: > > cluster pg_trigger using pg_class_oid_index, similarly cluster on any > > user_catalog_table, then the other problematic operation could > > truncate of user_catalog_table as is discussed in another thread [1]. > > I think all such operations can block even with synchronous > > replication. I am not sure if we can create examples for all cases > > because for ex. we don't have use of user_catalog_tables in-core but > > maybe for others, we can try to create examples and see what happens? > > > > If all such operations can block for synchronous replication and > > prepared transactions replication then we might want to document them > > as caveats at page: > > https://www.postgresql.org/docs/devel/logicaldecoding-synchronous.html > > and then also give the reference for these caveats at prepared > > transactions > > page:https://www.postgresql.org/docs/devel/logicaldecoding-two-phase-commits.html > > > > What do you think? > > It seems to me that the 2PC issues on catalog tables and the issues > related to logical replication in synchonous mode are two distinct > things that need to be fixed separately. > > The issue with LOCK taken on a catalog while a PREPARE TRANSACTION > holds locks around is bad enough in itself as it could lock down a > user from a cluster as long as the PREPARE TRANSACTION is not removed > from WAL (say the relation is critical for the connection startup). > This could be really disruptive for the user even if he tried to take > a lock on an object he owns, and the way to recover is not easy here, > and the way to recover involves either an old backup or worse, > pg_resetwal. > > The second issue with logical replication is still disruptive, but it > looks to me more like a don't-do-it issue, and documenting the caveats > sounds fine enough. > > Looking at the patch from upthread.. > > + /* > +* Make note that we've locked a system table or an user catalog > +* table. This flag will be checked later during prepare > transaction > +* to fail the prepare transaction. > +*/ > + if (lockstmt->mode >= ExclusiveLock && > + (IsCatalogRelationOid(reloid) || > +RelationIsUsedAsCatalogTable(rel))) > + MyXactFlags |= XACT_FLAGS_ACQUIREDEXCLUSIVELOCK_SYSREL; > I think that I'd just use IsCatalogRelationOid() here, and I'd be more > severe and restrict all attempts for any lock levels. It seems to me > that this needs to happen within RangeVarCallbackForLockTable(). > I would also rename the flag as just XACT_FLAGS_LOCKEDCATALOG. > > +errmsg("cannot PREPARE a transaction that has an exclusive > lock on user catalog/system table(s)"))); > What about "cannot PREPARE a transaction that has locked a catalog > relation"? At this point it is not clear if we are planning to fix this issue by throwing an error or document it. I will fix these comments once we come to consensus. Regards, Vignesh
Re: Skipping logical replication transactions on subscriber side
On Tue, May 25, 2021 at 6:12 PM Masahiko Sawada wrote: > > On Tue, May 25, 2021 at 7:21 PM Bharath Rupireddy > wrote: > > > > If there's no way to get the "correct LSN", then why can't we just > > print that LSN in the error context and/or in the new statistics view > > for logical replication workers, so that any of the existing ways can > > be used to skip exactly one txn? > > I think specifying XID to the subscription is more understandable for users. > I agree with you that specifying XID could be easier and understandable for users. I was thinking and studying a bit about what other systems do in this regard. Why don't we try to provide conflict resolution methods for users? The idea could be that either the conflicts can be resolved automatically or manually. In the case of manual resolution, users can use the existing methods or the XID stuff you are proposing here and in case of automatic resolution, the in-built or corresponding user-defined functions will be invoked for conflict resolution. There are more details to figure out in the automatic resolution scheme but I see a lot of value in doing the same. -- With Regards, Amit Kapila.
Re: locking [user] catalog tables vs 2pc vs logical rep
On Wed, May 26, 2021 at 1:53 PM Petr Jelinek wrote: > > On 26 May 2021, at 05:04, Amit Kapila wrote: > > > > On Tue, May 25, 2021 at 12:40 PM Michael Paquier > > wrote: > >> > >> It seems to me that the 2PC issues on catalog tables and the issues > >> related to logical replication in synchonous mode are two distinct > >> things that need to be fixed separately. > >> > > > > Fair enough. But the way we were looking at them as they will also > > block (lead to deadlock) for logical replication of prepared > > transactions and also logical replication in synchonous mode without > > prepared transactions. Now, if we want to deal with the 2PC issues > > separately that should be fine as well. However, for that we need to > > see which all operations we want to block on [user]_catalog_tables. > > The first one is lock command, then there are other operations like > > Cluster which take exclusive lock on system catalog tables and we > > allow them to be part of prepared transactions (example Cluster > > pg_trigger using pg_trigger_oid_index;), another kind of operation is > > Truncate on user_catalog_tables. Now, some of these might not allow > > connecting after restart so we might need to think whether we want to > > prohibit all such operations or only some of them. > > > > > IIRC this was discussed the first time 2PC decoding was proposed and > everybody seemed fine with the limitation so I'd vote for just documenting > it, same way as the sync rep issue. > +1. > If you'd prefer fixing it by blocking something, wouldn't it make more sense > to simply not allow PREPARE if one of these operations was executed, > similarly to what we do with temp table access? > The point was that even if somehow we block for prepare, there doesn't seem to be a simple way for synchronous logical replication which can also have similar problems. So, I would prefer to document it and we can even think to backpatch the sync rep related documentation. Michael seems to be a bit interested in dealing with some of the 2PC issues due to reasons different than logical replication which I am not completely sure is a good idea and Tom also feels that is not a good idea. -- With Regards, Amit Kapila.
Re: Add ZSON extension to /contrib/
Hi hackers, Many thanks for your feedback, I very much appreciate it! > If the extension is mature enough, why make it an extension in > contrib, and not instead either enhance the existing jsonb type with > it or make it a built-in type? > IMO we have too d*mn many JSON types already. If we can find a way > to shoehorn this optimization into JSONB, that'd be great. Otherwise > I do not think it's worth the added user confusion. Magnus, Tom, My reasoning is that if the problem can be solved with an extension there is little reason to modify the core. This seems to be in the spirit of PostgreSQL. If the community reaches the consensus to modify the core to introduce a similar feature, we could discuss this as well. It sounds like a lot of unnecessary work to me though (see below). > * doesn't cover all cases, notably indexes. Tom, Not sure if I follow. What cases do you have in mind? > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. Matthias, Good point. I suspect that PostGIS is an exception though... > I like the idea of the ZSON type, but I'm somewhat disappointed by its > current limitations Several people suggested various enhancements right after learning about ZSON. Time showed, however, that none of the real-world users really need e.g. more than one common dictionary per database. I suspect this is because no one has more than 2**16 repeatable unique strings (one dictionary limitation) in their documents. Thus there is no benefit in having separate dictionaries and corresponding extra complexity. > - Each dictionary uses a lot of memory, regardless of the number of > actual stored keys. For 32-bit systems the base usage of a dictionary > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly > more than I'd want to install. You are probably right on this one, this part could be optimized. I will address this if we agree on submitting the patch. > - You call gettimeofday() in both dict_get and in get_current_dict_id. > These functions can be called in short and tight loops (for small GSON > fields), in which case it would add significant overhead through the > implied syscalls. I must admit, I'm not an expert in this area. My understanding is that gettimeofday() is implemented as single virtual memory access on modern operating systems, e.g. VDSO on Linux, thus it's very cheap. I'm not that sure about other supported platforms though. Probably worth investigating. > It does mean that you're deTOASTing > the full GSON field, and that the stored bytestring will not be > structured / doesn't work well with current debuggers. Unfortunately, I'm not very well aware of debugging tools in this context. Could you please name the debuggers I should take into account? > We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years > ago, and I have permission to contribute those if this proposal is adopted. Andrew, That's great, and personally I very much like the enhancements you've made. Purely out of curiosity, did they ended up as a part of 2ndQiadrant / EDB products? I will be happy to accept a pull request with these enhancements regardless of how the story with this proposal ends up. > Quite so. To some extent it's a toy. But at least one of our customers > has found it useful, and judging by Aleksander's email they aren't > alone. Indeed, this is an extremely simple extension, ~500 effective lines of code in C. It addresses a somewhat specific scenario, which, to my regret, doesn't seem to be uncommon. A pain-killer of a sort. In an ideal world, people suppose simply to normalize their data. -- Best regards, Aleksander Alekseev
Re: Fix typo: multiple tuple => tuples
On Wed, May 26, 2021 at 07:37:15AM +, houzj.f...@fujitsu.com wrote: > I found a possible typo in the code comments of heap_multi_insert. > > - * heap_multi_insert - insert multiple tuple into a heap > + * heap_multi_insert - insert multiple tuples into a heap > > Attaching a patch to fix it. Thanks, fixed. -- Michael signature.asc Description: PGP signature
Re: Parallel Inserts in CREATE TABLE AS
On Tue, May 25, 2021 at 12:05 PM houzj.f...@fujitsu.com wrote: > > I noticed one place which could be one of the reasons that cause the > performance degradation. > > + /* > +* We don't need to skip contacting FSM while inserting > tuples for > +* parallel mode, while extending the relations, workers > instead of > +* blocking on a page while another worker is inserting, can > check the > +* FSM for another page that can accommodate the tuples. This > results > +* in major benefit for parallel inserts. > +*/ > + myState->ti_options = 0; > > I am not quite sure that disabling the " SKIP FSM " in parallel worker will > bring performance gain. > In my test environment, if I change this code to use option " > TABLE_INSERT_SKIP_FSM ", then there > seems no performance degradation . Could you please have a try on it ? > (I test with the SQL you provided earlier[1]) Thanks for trying that out. Please see the code around the use_fsm flag in RelationGetBufferForTuple for more understanding of the points below. What happens if FSM is skipped i.e. myState->ti_options = TABLE_INSERT_SKIP_FSM;? 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple. 2) Each worker initially gets a block and keeps inserting into it until it is full. When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace as use_fsm is false. It directly goes for relation extension and tries to acquire relation extension lock with LockRelationForExtension. Note that the bulk extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is false. 3) After acquiring the relation extension lock, it adds an extra new block with ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever extension we performed above, we always add at least one block to satisfy our own request." The tuple is inserted into this new block. Basically, the workers can't look for the empty pages from the pages added by other workers, they keep doing the above steps in silos. What happens if FSM is not skipped i.e. myState->ti_options = 0;? 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple. 2) Each worker initially gets a block and keeps inserting into it until it is full. When the block is full, the worker looks for the page with free space in FSM GetPageWithFreeSpace as use_fsm is true. If it can't find any page with the required amount of free space, it goes for bulk relation extension(RelationAddExtraBlocks) after acquiring relation extension lock with ConditionalLockRelationForExtension. Then the worker adds extraBlocks = Min(512, lockWaiters * 20); new blocks in RelationAddExtraBlocks and immediately updates the bottom level of FSM for each block (see the comment around RecordPageWithFreeSpace for why only the bottom level, not the entire FSM tree). After all the blocks are added, then it updates the entire FSM tree FreeSpaceMapVacuumRange. 4) After the bulk extension, then the worker adds another block see the comment "In addition to whatever extension we performed above, we always add at least one block to satisfy our own request." and inserts tuple into this new block. Basically, the workers can benefit from the bulk extension of the relation and they always can look for the empty pages from the pages added by other workers. There are high chances that the blocks will be available after bulk extension. Having said that, if the added extra blocks are consumed by the workers so fast i.e. if the tuple sizes are big i.e very less tuples per page, then, the bulk extension too can't help much and there will be more contention on the relation extension lock. Well, one might think to add more blocks at a time, say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, lockWaiters * 20);. This will work (i.e. we don't see any regression with parallel inserts in CTAS patches), but it can't be a practical solution. Because the total pages for the relation will be more with many pages having more free space. Furthermore, the future sequential scans on that relation might take a lot of time. If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if (myState->is_parallel)), then it will be effective for leader i.e. leader will not look for FSM, but all the workers will, because within if (myState->is_parallel_worker) in intorel_startup, myState->ti_options = 0; for workers. I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn tuples) with leader participation where I'm seeing regression: 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and workers, then my results are as follows: 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275 2) when myState->ti_options = 0; for both leade
Re: Parallel Inserts in CREATE TABLE AS
On Tue, May 25, 2021 at 1:10 PM tsunakawa.ta...@fujitsu.com wrote: > > Although this should be a controversial and may be crazy idea, the following > change brought 4-11% speedup. This is because I thought parallel workers > might contend for WAL flush as a result of them using the limited ring buffer > and flushing dirty buffers when the ring buffer is filled. Can we take > advantage of this? > > [GetBulkInsertState] > /* bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/ > bistate->strategy = NULL; You are right. If ring buffer(16MB) is not used and shared buffers(1GB) are used instead, in your case since the table size is 335MB and it can fit in the shared buffers, there will not be any or will be very minimal dirty buffer flushing, so there will be more some more speedup. Otherwise, the similar speed up can be observed when the BAS_BULKWRITE is increased a bit from the current 16MB to some other reasonable value. I earlier tried these experiments. Otherwise, as I said in [1], we can also increase the number of extra blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, lockWaiters * 20);. This will also give some speedup and we don't see any regression with parallel inserts in CTAS patches. But, I'm not so sure that the hackers will agree any of the above as a practical solution to the "relation extension" problem. [1] https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Inserts in CREATE TABLE AS
On Tue, May 25, 2021 at 1:50 PM tsunakawa.ta...@fujitsu.com wrote: > > From: houzj.f...@fujitsu.com > > + /* > > + * We don't need to skip contacting FSM while inserting tuples > > for > > + * parallel mode, while extending the relations, workers > > instead of > > + * blocking on a page while another worker is inserting, can > > check the > > + * FSM for another page that can accommodate the tuples. > > This results > > + * in major benefit for parallel inserts. > > + */ > > + myState->ti_options = 0; > > > > I am not quite sure that disabling the " SKIP FSM " in parallel worker will > > bring > > performance gain. > > In my test environment, if I change this code to use option " > > TABLE_INSERT_SKIP_FSM ", then there > > seems no performance degradation. > > +1, probably. I tried to explain it at [1]. Please have a look. > Does the code comment represent the situation like this? > > 1. Worker 1 is inserting into page 1. > > 2. Worker 2 tries to insert into page 1, but cannot acquire the buffer > content lock of page 1 because worker 1 holds it. > > 3. Worker 2 looks up FSM to find a page with enough free space. I tried to explain it at [1]. Please have a look. > But isn't FSM still empty during CTAS? No, FSM will be built on the fly in case if we don't skip the FSM i.e. myState->ti_options = 0, see RelationGetBufferForTuple with use_fsm = true -> GetPageWithFreeSpace -> fsm_search -> fsm_set_and_search -> fsm_readbuf with extend = true. [1] https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Assertion failure while streaming toasted data
On Wed, May 26, 2021 at 1:47 PM Dilip Kumar wrote: > > On Wed, May 26, 2021 at 11:53 AM Dilip Kumar wrote: > > > > On Wed, May 26, 2021 at 11:19 AM Amit Kapila > > wrote: > > > > > > > > > I searched and didn't find any similar existing tests. Can we think of > > > any other way to test this code path? We already have one copy test in > > > toast.sql, isn't it possible to write a similar test here? > > > > I will check that and let you know. > > I have followed this approach to write the test. > The changes look good to me. I have made minor modifications in the comments, see attached. Let me know what do you think? -- With Regards, Amit Kapila. v4-0001-Fix-assertion-during-streaming-of-multi-insert-to.patch Description: Binary data
Re: Parallel Inserts in CREATE TABLE AS
On Fri, May 21, 2021 at 3:46 PM Amit Kapila wrote: > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > wrote: > > > > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy > > wrote: > > > > > > > I analyzed performance of parallel inserts in CTAS for different cases > > with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could > > gain if the tuple sizes are lower. But if the tuple size is larger > > i..e 1064bytes, there's a regression with parallel inserts. Upon > > further analysis, it turned out that the parallel workers are > > requiring frequent extra blocks addition while concurrently extending > > the relation(in RelationAddExtraBlocks) and the majority of the time > > spent is going into flushing those new empty pages/blocks onto the > > disk. > > > > How you have ensured that the cost is due to the flushing of pages? I think I'm wrong to just say the problem is with the flushing of empty pages when bulk extending the relation. I should have said the problem is with the "relation extension lock", but I will hold on to it for a moment until I capture the relation extension lock wait events for the regression causing cases. I will share the information soon. > AFAICS, we don't flush the pages rather just write them and then > register those to be flushed by checkpointer, now it is possible that > the checkpointer sync queue gets full and the backend has to write by > itself but have we checked that? I think we can check via wait events, > if it is due to flush then we should see a lot of file sync > (WAIT_EVENT_DATA_FILE_SYNC) wait events. I will also capture the data file sync events along with relation extension lock wait events. > The other possibility could > be that the free pages added to FSM by one worker are not being used > by another worker due to some reason. Can we debug and check if the > pages added by one worker are being used by another worker? I tried to explain it at [1]. Please have a look. It looks like the burden is more on the "relation extension lock" and the way the extra new blocks are getting added. [1] https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Inserts in CREATE TABLE AS
On Wed, May 26, 2021 at 5:28 PM Bharath Rupireddy wrote: > > On Fri, May 21, 2021 at 3:46 PM Amit Kapila wrote: > > > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > > wrote: > > > > > > The other possibility could > > be that the free pages added to FSM by one worker are not being used > > by another worker due to some reason. Can we debug and check if the > > pages added by one worker are being used by another worker? > > I tried to explain it at [1]. Please have a look. > I have read it but I think we should try to ensure practically what is happening because it is possible that first time worker checked in FSM without taking relation extension lock, it didn't find any free page, and then when it tried to acquire the conditional lock, it got the same and just extended the relation by one block. So, in such a case it won't be able to use the newly added pages by another worker. I am not sure any such thing is happening here but I think it is better to verify it in some way. Also, I am not sure if just getting the info about the relation extension lock is sufficient? -- With Regards, Amit Kapila.
Re: Assertion failure while streaming toasted data
On Wed, May 26, 2021 at 5:28 PM Amit Kapila wrote: > > On Wed, May 26, 2021 at 1:47 PM Dilip Kumar wrote: > > > > On Wed, May 26, 2021 at 11:53 AM Dilip Kumar wrote: > > > > > > On Wed, May 26, 2021 at 11:19 AM Amit Kapila > > > wrote: > > > > > > > > > > > > > I searched and didn't find any similar existing tests. Can we think of > > > > any other way to test this code path? We already have one copy test in > > > > toast.sql, isn't it possible to write a similar test here? > > > > > > I will check that and let you know. > > > > I have followed this approach to write the test. > > > > The changes look good to me. I have made minor modifications in the > comments, see attached. Let me know what do you think? Your modification looks good to me. Thanks! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Fdw batch insert error out when set batch_size > 65535
On 5/26/21 8:57 AM, Bharath Rupireddy wrote: On Tue, May 25, 2021 at 2:47 PM Bharath Rupireddy wrote: On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com wrote: Thanks for the comments. I have addressed all comments to the v3 patch. Thanks! The patch basically looks good to me except that it is missing a commit message. I think it can be added now. With v3 patch, I observed failure in postgres_fdw test cases with insert query in prepared statements. Root cause is that in postgresGetForeignModifyBatchSize, fmstate can be null (see the existing code which handles for fmstate null cases). I fixed this, and added a commit message. PSA v4 patch. Thanks. In what situation is the fmstate NULL? If it is NULL, the current code simply skips the line adjusting it. Doesn't that mean we may not actually fix the bug in that case? Also, I think it'd be keep the existing comment, probably as the first line of the new comment block. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Fdw batch insert error out when set batch_size > 65535
On Wed, May 26, 2021 at 6:36 PM Tomas Vondra wrote: > > On 5/26/21 8:57 AM, Bharath Rupireddy wrote: > > On Tue, May 25, 2021 at 2:47 PM Bharath Rupireddy > > wrote: > >> > >> On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com > >> wrote: > >>> Thanks for the comments. I have addressed all comments to the v3 patch. > >> > >> Thanks! The patch basically looks good to me except that it is missing > >> a commit message. I think it can be added now. > > > > With v3 patch, I observed failure in postgres_fdw test cases with > > insert query in prepared statements. Root cause is that in > > postgresGetForeignModifyBatchSize, fmstate can be null (see the > > existing code which handles for fmstate null cases). I fixed this, and > > added a commit message. PSA v4 patch. > > > > Thanks. In what situation is the fmstate NULL? If it is NULL, the > current code simply skips the line adjusting it. Doesn't that mean we > may not actually fix the bug in that case? fmstate i.e. resultRelInfo->ri_FdwState is NULL for EXPLAIN without ANALYZE cases, below comment says it and we can't get the bug because we don't actually execute the insert statement. The bug occurs on the remote server when the insert query with those many query parameters is submitted to the remote server. I'm not sure if there are any other cases where it can be NULL. /* * In EXPLAIN without ANALYZE, ri_fdwstate is NULL, so we have to lookup * the option directly in server/table options. Otherwise just use the * value we determined earlier. */ if (fmstate) batch_size = fmstate->batch_size; else batch_size = get_batch_size_option(resultRelInfo->ri_RelationDesc); > Also, I think it'd be keep the existing comment, probably as the first > line of the new comment block. Do you mean to say we need to retain "/* Otherwise use the batch size specified for server/table. */"? If so, PSA v5. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v5-0001-Adjust-batch_size-to-not-exceed-max-param-limit-o.patch Description: Binary data
Re: CREATE COLLATION - check for duplicate options and error out if found one
On Tue, Apr 27, 2021 at 3:21 PM Bharath Rupireddy wrote: > > Hi, > > While reviewing [1], I found that the CREATE COLLATION doesn't throw an error > if duplicate options are specified, see [2] for testing a few cases on HEAD. > This may end up accepting some of the weird cases, see [2]. It's against > other option checking code in the server where the duplicate option is > detected and an error thrown if found one. Attached a patch doing that. I > chose to have the error message "option \"%s\" specified more than once" and > parser_errposition because that's kind of agreed in [3]. > > Thoughts? > > [1] > https://www.postgresql.org/message-id/CALj2ACWVd%3D-E6uG5AdHD0MvHY6e4mVzkapT%3DvLDnJJseGjaJLQ%40mail.gmail.com > > [2] > CREATE COLLATION coll_dup_chk (LC_COLLATE = "POSIX", LC_COLLATE = "NONSENSE", > LC_CTYPE = "POSIX"); -- ERROR > CREATE COLLATION coll_dup_chk (LC_COLLATE = "NONSENSE", LC_COLLATE = "POSIX", > LC_CTYPE = "POSIX"); -- OK but it's weird > CREATE COLLATION coll_dup_chk (LC_CTYPE = "POSIX", LC_CTYPE = "NONSENSE", > LC_COLLATE = "POSIX"); -- ERROR > CREATE COLLATION coll_dup_chk (LC_CTYPE = "NONSENSE", LC_CTYPE = "POSIX", > LC_COLLATE = "POSIX",); -- OK but it's weird > CREATE COLLATION coll_dup_chk (PROVIDER = icu, PROVIDER = NONSENSE, > LC_COLLATE = "POSIX", LC_CTYPE = "POSIX"); -- ERROR > CREATE COLLATION coll_dup_chk (PROVIDER = NONSENSE, PROVIDER = icu, > LC_COLLATE = "POSIX", LC_CTYPE = "POSIX"); -- OK but it's weird > CREATE COLLATION case_sensitive (LOCALE = '', LOCALE = "NONSENSE"); -- ERROR > CREATE COLLATION coll_dup_chk (LOCALE = "NONSENSE", LOCALE = ''); -- OK but > it's weird > CREATE COLLATION coll_dup_chk (DETERMINISTIC = TRUE, DETERMINISTIC = > NONSENSE, LOCALE = ''); -- ERROR > CREATE COLLATION coll_dup_chk (DETERMINISTIC = NONSENSE, DETERMINISTIC = > TRUE, LOCALE = ''); -- OK but it's weird > > [3] > https://www.postgresql.org/message-id/CALj2ACUa%3DZM8QtOLPCHc7%3DWgFrx9P6-AgKQs8cmKLvNCvu7arQ%40mail.gmail.com +1 for fixing this issue, we have handled this error in other places. The patch does not apply on head, could you rebase the patch on head and post a new patch. Regards, Vignesh
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Mon, Apr 5, 2021 at 7:19 PM Bharath Rupireddy wrote: > > On Mon, Apr 5, 2021 at 6:41 PM Euler Taveira wrote: > > Here's the v4 patch reabsed on the latest master, please review it further. > > > > /* UNLOGGED and TEMP relations cannot be part of publication. */ > > if (!RelationIsPermanent(targetrel)) > > - ereport(ERROR, > > - (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > - errmsg("table \"%s\" cannot be replicated", > > - RelationGetRelationName(targetrel)), > > - errdetail("Temporary and unlogged relations cannot be replicated."))); > > + { > > + if (RelationUsesLocalBuffers(targetrel)) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("\"%s\" is a temporary table", > > + RelationGetRelationName(targetrel)), > > + errdetail("Temporary tables cannot be added to publications."))); > > + else if (targetrel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("\"%s\" is an unlogged table", > > + RelationGetRelationName(targetrel)), > > + errdetail("Unlogged tables cannot be added to publications."))); > > + } > > > > RelationIsPermanent(), RelationUsesLocalBuffers(), and > > targetrel->rd_rel->relpersistence all refers to relpersistence. Hence, it is > > not necessary to test !RelationIsPermanent(). > > Done. > > > I would slightly rewrite the commit message to something like: > > > > Improve publication error messages > > > > Adding a foreign table into a publication prints an error saying "foo is > > not a > > table". Although, a foreign table is not a regular table, this message could > > possibly confuse users. Provide a suitable error message according to the > > object class (table vs foreign table). While at it, separate unlogged/temp > > table error message into 2 messages. > > Thanks for the better wording. > > Attaching v5 patch, please have a look. > We get the following error while adding an index: create publication mypub for table idx_t1; ERROR: "idx_t1" is an index This error occurs internally from table_openrv function call, if we could replace this with relation_openrv and then check the table kind, we could throw a similar error message here too like the other changes in the patch. Regards, Vignesh
Re: CREATE COLLATION - check for duplicate options and error out if found one
On Wed, May 26, 2021 at 7:18 PM vignesh C wrote: > +1 for fixing this issue, we have handled this error in other places. > The patch does not apply on head, could you rebase the patch on head > and post a new patch. Thanks. I thought of rebasing once the other patch (which reorganizes "...specified more than once" error) gets committed. Anyways, I've rebased for now on the latest master. Please review v2 patch. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v2-0001-Check-duplicate-options-and-error-out-for-CREATE-.patch Description: Binary data
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Wed, May 26, 2021 at 7:38 PM vignesh C wrote: > > Attaching v5 patch, please have a look. > > We get the following error while adding an index: > create publication mypub for table idx_t1; > ERROR: "idx_t1" is an index > > This error occurs internally from table_openrv function call, if we > could replace this with relation_openrv and then check the table kind, > we could throw a similar error message here too like the other changes > in the patch. Do you say that we replace table_open in publication_add_relation with relation_open and have the "\"%s\" is an index" or "\"%s\" is a composite type" checks in check_publication_add_relation? If that is so, I don't think it's a good idea to have the extra code in check_publication_add_relation and I would like it to be the way it is currently. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: pg_rewind fails if there is a read only file.
On 5/26/21 2:43 AM, Laurenz Albe wrote: > On Wed, 2021-05-26 at 08:57 +0900, Michael Paquier wrote: >> On Tue, May 25, 2021 at 03:17:37PM -0400, Andrew Dunstan wrote: >>> If we do decide to do something the question arises what should it do? >>> If we're to allow for it I'm wondering if the best thing would be simply >>> to ignore such a file. >> Enforcing assumptions that any file could be ready-only is a very bad >> idea, as that could lead to weird behaviors if a FS is turned as >> becoming read-only suddenly while doing a rewind. Another idea that >> has popped out across the years was to add an option to pg_rewind so >> as users could filter files manually. That could be easily dangerous >> though in the wrong hands, as one could think that it is a good idea >> to skip a control file, for example. >> >> The thing is that here we actually know the set of files we'd like to >> ignore most of the time, and we still want to have some automated >> control what gets filtered. So here is a new idea: we build a list of >> files based on a set of GUC parameters using postgres -C on the target >> data folder, and assume that these are safe enough to be skipped all >> the time, if these are in the data folder. > That sounds complicated, but should work. > There should be a code comment somewhere that warns people not to forget > to look at that when they add a new GUC. > > I can think of two alternatives to handle this: > > - Skip files that cannot be opened for writing and issue a warning. > That is simple, but coarse. > A slightly more sophisticated version would first check if files > are the same on both machines and skip the warning for those. > > - Paul's idea to try and change the mode on the read-only file > and reset it to the original state after pg_rewind is done. > How about we only skip (with a warning) read-only files if they are in the data root, not a subdirectory? That would save us from silently ignoring read-only filesystems and not involve using a GUC. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Next Commitfest Manager.
On Thu, Feb 4, 2021 at 1:13 AM Stephen Frost wrote: > Greetings, > > * Ibrar Ahmed (ibrar.ah...@gmail.com) wrote: > > Anyone else already volunteers that? It is my first time so need some > > access, if all agree. > > Thanks for volunteering! > > That said, our last commitfest tends to be the most difficult as it's > the last opportunity for features to land in time for the next major > release and, for my part at least, I think it'd be best to have > someone who has experience running a CF previously manage it. > > To that end, I've talked to David Steele, who has run this last CF for > the past few years and we're in agreement that he's willing to run this > CF again this year, assuming there's no objections. What we've thought > to suggest is that you follow along with David as he runs this CF and > then offer to run the July CF. Of course, we would encourage you and > David to communicate and for you to ask David any questions you have > about how he handles things as part of the CF. This is in line with how > other CF managers have started out also. > As we know July commitfest is coming, I already volunteered to manage that. I did small work with David in the last commitfest and now ready to work on that > > Open to your thoughts, as well as those of anyone else who wishes to > comment. > > Thanks! > > Stephen > -- Ibrar Ahmed
Re: Replacing pg_depend PIN entries with a fixed range check
On Wed, May 12, 2021 at 6:21 PM Tom Lane wrote: > Here's a v2 that does things that way (and is rebased up to HEAD). > I did some more documentation cleanup, too. The first hunk of the patch seems to back away from the idea that the cutoff is 13000, but the second half of the patch says 13000 still matters. Not sure I understand what's going on there exactly. I suggest deleting the words "An additional thing that is useful to know is that" because the rest of the sentence is fine without it. I'm sort of wondering what we think the long term plan ought to be. Are there some categories of things we should be looking to move out of the reserved OID space to keep it from filling up? Can we realistically think of moving the 16384 boundary? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add ZSON extension to /contrib/
On 25.05.2021 13:55, Aleksander Alekseev wrote: Hi hackers, Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space, memory and increases the overall performance. More details can be found in README on GitHub. The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people using Amazon RDS and similar services could enjoy it too. Back then I was not sure if the extension is mature enough and if it lacks any additional features required to solve the real-world problems of the users. Time showed, however, that people are happy with the extension as it is. There were several minor issues discovered, but they were fixed back in 2017. The extension never experienced any compatibility problems with the next major release of PostgreSQL. So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this thread to the nearest CF and submit a corresponding patch. [1]: https://github.com/postgrespro/zson -- Best regards, Aleksander Alekseev Open-Source PostgreSQL Contributor at Timescale Yet another approach to the same problem: https://github.com/postgrespro/jsonb_schema Instead of compression JSONs we can try to automatically detect JSON schema (names and types of JSON fields) and store it separately from values. This approach is more similar with one used in schema-less databases. It is most efficient if there are many JSON records with the same schema and sizes of keys are comparable with size of values. At IMDB data set it cause reducing of database size about 1.7 times.
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Michael Paquier writes: > Ah, the parallel with reltablespace and default_tablespace at database > level is a very good point. It is true that currently the code would > assign attcompression to a non-zero value once the relation is defined > depending on default_toast_compression set for the database, but > setting it to 0 in this case would be really helpful to change the > compression methods of all the relations if doing something as crazy > as a VACUUM FULL for this database. Count me as convinced. Here's a draft patch series to address this. 0001 removes the relkind checks I was questioning originally. As expected, this results in zero changes in check-world results. 0002 is the main change in the semantics of attcompression. This does change the results of compression.sql, but in what seem to me to be expected ways: a column's compression option is now shown in \d+ output only if you explicitly set it. 0003 further removes pg_dump's special handling of default_toast_compression. I don't think we need that anymore. AFAICS its only effect would be to override the receiving server's default_toast_compression setting for dumped/re-loaded data, which does not seem like a behavior that anyone would want. Loose ends: * I've not reviewed the docs fully; there are likely some more things that need updated. * As things stand here, once you've applied ALTER ... SET COMPRESSION to select a specific method, there is no way to undo that and go back to the use-the-default setting. All you can do is change to explicitly select the other method. Should we invent "ALTER ... SET COMPRESSION default" or the like to cover that? (Since DEFAULT is a reserved word, that exact syntax might be a bit of a pain to implement, but maybe we could think of another word.) * I find GetDefaultToastCompression() annoying. I do not think it is project style to invent trivial wrapper functions around GUC variable references: it buys nothing while requiring readers to remember one more name than they would otherwise. Since there are only two uses remaining, maybe this isn't very important either way, but I'm still inclined to flush it. Comments? regards, tom lane diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 11e91c4ad3..69b1839fd7 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -897,17 +897,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId, if (colDef->generated) attr->attgenerated = colDef->generated; - /* - * lookup attribute's compression method and store it in the - * attr->attcompression. - */ - if (relkind == RELKIND_RELATION || - relkind == RELKIND_PARTITIONED_TABLE || - relkind == RELKIND_MATVIEW) - attr->attcompression = -GetAttributeCompression(attr, colDef->compression); - else - attr->attcompression = InvalidCompressionMethod; + attr->attcompression = GetAttributeCompression(attr, + colDef->compression); } /* @@ -6602,13 +6593,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel, attribute.attbyval = tform->typbyval; attribute.attalign = tform->typalign; attribute.attstorage = tform->typstorage; - /* do not set compression in views etc */ - if (rel->rd_rel->relkind == RELKIND_RELATION || - rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - attribute.attcompression = GetAttributeCompression(&attribute, - colDef->compression); - else - attribute.attcompression = InvalidCompressionMethod; + attribute.attcompression = GetAttributeCompression(&attribute, + colDef->compression); attribute.attnotnull = colDef->is_not_null; attribute.atthasdef = false; attribute.atthasmissing = false; @@ -12299,23 +12285,10 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel, attTup->attbyval = tform->typbyval; attTup->attalign = tform->typalign; attTup->attstorage = tform->typstorage; - - /* Setup attribute compression */ - if (rel->rd_rel->relkind == RELKIND_RELATION || - rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - { - /* - * No compression for plain/external storage, otherwise, default - * compression method if it is not already set, refer comments atop - * attcompression parameter in pg_attribute.h. - */ - if (!IsStorageCompressible(tform->typstorage)) - attTup->attcompression = InvalidCompressionMethod; - else if (!CompressionMethodIsValid(attTup->attcompression)) - attTup->attcompression = GetDefaultToastCompression(); - } - else + if (!IsStorageCompressible(tform->typstorage)) attTup->attcompression = InvalidCompressionMethod; + else if (!CompressionMethodIsValid(attTup->attcompression)) + attTup->attcompression = GetDefaultToastCompression(); ReleaseSysCache(typeTuple); diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index b26b872a06..16493209c6 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Wed, May 26, 2021 at 11:13 AM Tom Lane wrote: > * As things stand here, once you've applied ALTER ... SET COMPRESSION > to select a specific method, there is no way to undo that and go > back to the use-the-default setting. All you can do is change to > explicitly select the other method. Should we invent "ALTER ... > SET COMPRESSION default" or the like to cover that? (Since > DEFAULT is a reserved word, that exact syntax might be a bit of > a pain to implement, but maybe we could think of another word.) Yes. Irreversible catalog changes are bad. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Replacing pg_depend PIN entries with a fixed range check
Robert Haas writes: > The first hunk of the patch seems to back away from the idea that the > cutoff is 13000, but the second half of the patch says 13000 still > matters. Not sure I understand what's going on there exactly. Not sure exactly what you're looking at, but IIRC there is a place where the patch is cleaning up after ab596105b's failure to adjust bki.sgml to match its change of FirstBootstrapObjectId from 12000 to 13000. I hadn't bothered to fix that separately, but I guess we should do so, else v14 is going to ship with incorrect docs. > I'm sort of wondering what we think the long term plan ought to be. > Are there some categories of things we should be looking to move out > of the reserved OID space to keep it from filling up? Can we > realistically think of moving the 16384 boundary? I haven't got any wonderful ideas there. I do not see how we can move the 16384 boundary without breaking pg_upgrade'ing, because pg_upgrade relies on preserving user object OIDs that are likely to be not much above that value. Probably, upping FirstNormalObjectId ought to be high on our list of things to do if we ever do force an on-disk compatibility break. In the meantime, we could decrease the 1 boundary if things get tight above that, but I fear that would annoy some extension maintainers. Another idea is to give up the principle that initdb-time OIDs need to be globally unique. They only really need to be unique within their own catalogs, so we could buy a lot of space by exploiting that. The original reason for that policy was to reduce the risk of mistakes in handwritten OID references in the initial catalog data --- but now that numeric references there are Not Done, it seems like we don't really need that. An intermediate step, perhaps, could be to give up that uniqueness only for OIDs assigned by genbki.pl itself, while keeping it for OIDs below 1. This'd be appealing if we find that we're getting tight between 10K and 13K. In any case it doesn't seem like the issue is entirely pressing yet. Although ... maybe we should do that last bit now, so that we can revert FirstBootstrapObjectId to 12K before v14 ships? I've felt a little bit of worry that that change might cause problems on machines with a boatload of locales. regards, tom lane
Re: Race condition in recovery?
On Wed, May 26, 2021 at 2:44 AM Dilip Kumar wrote: > I think we need to create some content on promoted standby and check > whether the cascade standby is able to get that or not, that will > guarantee that it is actually following the promoted standby, I have > added the test for that so that it matches the comments. OK, but I ran this test against an unpatched server and it passed. That's not so great, because the test should fail without the bug fix. It seems to be because there's only one actual test in this test case. Looking at the log file, src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the only "ok" nor "not ok" line is: ok 1 - check streamed content on cascade standby So either that test is not right or some other test is needed. I think there's something else going wrong here, because when I run my original test.sh script, I see this: 2021-05-26 11:37:47.794 EDT [57961] LOG: restored log file "0002.history" from archive ... 2021-05-26 11:37:47.916 EDT [57961] LOG: redo starts at 0/228 ... 2021-05-26 11:37:47.927 EDT [57966] LOG: replication terminated by primary server 2021-05-26 11:37:47.927 EDT [57966] DETAIL: End of WAL reached on timeline 1 at 0/300 But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log file generated by this test case: cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0002.history: No such file or directory ... 2021-05-26 11:41:08.149 EDT [63347] LOG: fetching timeline history file for timeline 2 from primary server ... 2021-05-26 11:41:08.288 EDT [63344] LOG: new target timeline is 2 ... 2021-05-26 11:41:08.303 EDT [63344] LOG: redo starts at 0/228 ... 2021-05-26 11:41:08.331 EDT [63347] LOG: restarted WAL streaming at 0/300 on timeline 2 So it doesn't seem like the test is actually reproducing the problem correctly. The timeline history file isn't available from the archive, so it streams it, and then the problem doesn't occur. I guess that's because there's nothing to guarantee that the history file reaches the archive before 'cascade' is started. The code just does: # Promote the standby. $node_standby->psql('postgres', 'SELECT pg_promote()'); # Start cascade node $node_cascade->start; ...which has a clear race condition. src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a WAL file to be archived, so maybe we can steal that logic and use it here. I suggest we rename the test to something a bit more descriptive. Like instead of 025_timeline_issue.pl, perhaps 025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but "timeline issue" is a bit too vague for my taste. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Race condition in recovery?
On Wed, May 26, 2021 at 9:40 PM Robert Haas wrote: > > On Wed, May 26, 2021 at 2:44 AM Dilip Kumar wrote: > > I think we need to create some content on promoted standby and check > > whether the cascade standby is able to get that or not, that will > > guarantee that it is actually following the promoted standby, I have > > added the test for that so that it matches the comments. > > OK, but I ran this test against an unpatched server and it passed. > That's not so great, because the test should fail without the bug fix. > It seems to be because there's only one actual test in this test case. > Looking at the log file, > src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the > only "ok" nor "not ok" line is: > > ok 1 - check streamed content on cascade standby > > So either that test is not right or some other test is needed. I think > there's something else going wrong here, because when I run my > original test.sh script, I see this: Thats strange, when I ran the test I can see below in log of cascade node (which shows that cascade get the history file but not the WAL file and then it select the old timeline and never go to the new timeline) ... 2021-05-26 21:46:54.412 IST [84080] LOG: restored log file "0002.history" from archive cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0003.history’: No such file or directory 2021-05-26 21:46:54.415 IST [84080] LOG: entering standby mode 2021-05-26 21:46:54.419 IST [84080] LOG: restored log file "0002.history" from archive . 2021-05-26 21:46:54.429 IST [84085] LOG: started streaming WAL from primary at 0/200 on timeline 1 -> stream using previous TL 2021-05-26 21:46:54.466 IST [84080] LOG: redo starts at 0/228 2021-05-26 21:46:54.466 IST [84080] LOG: consistent recovery state reached at 0/300 2021-05-26 21:46:54.467 IST [84079] LOG: database system is ready to accept read only connections 2021-05-26 21:46:54.483 IST [84085] LOG: replication terminated by primary server 2021-05-26 21:46:54.483 IST [84085] DETAIL: End of WAL reached on timeline 1 at 0/300. cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0003.history’: No such file or directory cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00010003’: No such file or directory 2021-05-26 21:46:54.498 IST [84085] LOG: primary server contains no more WAL on requested timeline 1 And finally the test case fails because the cascade can never get the changes. I will check if there is any timing dependency in the test case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Replacing pg_depend PIN entries with a fixed range check
On Wed, May 26, 2021 at 11:37 AM Tom Lane wrote: > In any case it doesn't seem like the issue is entirely pressing > yet. Although ... maybe we should do that last bit now, so > that we can revert FirstBootstrapObjectId to 12K before v14 > ships? I've felt a little bit of worry that that change might > cause problems on machines with a boatload of locales. I think that particular case is definitely worth worrying about. Most of what we put into the system catalogs is our own hand-crafted entries, but that's coming from the operating system and we have no control over it whatever. It wouldn't be very nice to have to suggest to users who get can't initdb that perhaps they should delete some locales... Honestly, it seems odd to me that these entries use reserved OIDs rather than regular ones at all. Why does the first run of pg_import_system_collations use special magic OIDs, and later runs use regular OIDs? pg_type OIDs need to remain stable from release to release since it's part of the on disk format for arrays, and pg_proc OIDs have to be the same at compile time and initdb time because of the fmgr hash table, and any other thing that has a constant that might be used in the source code also has that issue. But none of this applies to collations: they can't expected to have the same OID from release to release, or even from one installation to another; the source code can't be relying on the specific values; and we have no idea how many there might be. So I think your proposal of allowing genbki-assigned OIDs to be reused in different catalogs is probably a pretty good one, but I wonder if we could just rejigger things so that collations just get normal OIDs > 16384. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Race condition in recovery?
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar wrote: > I will check if there is any timing dependency in the test case. There is. I explained it in the second part of my email, which you may have failed to notice. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Race condition in recovery?
On Wed, 26 May 2021 at 10:06 PM, Robert Haas wrote: > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar > wrote: > > I will check if there is any timing dependency in the test case. > > There is. I explained it in the second part of my email, which you may > have failed to notice. Sorry, my bad. I got your point now. I will change the test. > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: CALL versus procedures with output-only arguments
On 25.05.21 22:21, Tom Lane wrote: Yeah, the odd behavior of CALL is where I started from, but now I think the main problem is with the signature (ie, allowing procedures with signatures that differ only in OUT parameter positions). If we got rid of that choice then it'd be possible to document that you should only ever write NULL for OUT-parameter positions, because the type of such an argument would never be significant for disambiguation. AFAICT, your patch does not main the property that CREATE PROCEDURE p1(OUT int, OUT int) corresponds to DROP PROCEDURE p1(int, int) which would be bad. I'm not opposed to reverting the feature if we can't find a good solution in a hurry. The main value is of this feature is for migrations, so I want to be sure that whatever we settle on doesn't back us into a corner with respect to that. We could perhaps also just disable the SQL-level calling until a better solution arises. AFAICT, things work okay in PL/pgSQL, because OUT parameters are tied to a typed target there.
Re: Add ZSON extension to /contrib/
On Wed, 26 May 2021 at 12:49, Aleksander Alekseev wrote: > > Hi hackers, > > Many thanks for your feedback, I very much appreciate it! > > > If the extension is mature enough, why make it an extension in > > contrib, and not instead either enhance the existing jsonb type with > > it or make it a built-in type? > > > IMO we have too d*mn many JSON types already. If we can find a way > > to shoehorn this optimization into JSONB, that'd be great. Otherwise > > I do not think it's worth the added user confusion. > > Magnus, Tom, > > My reasoning is that if the problem can be solved with an extension > there is little reason to modify the core. This seems to be in the > spirit of PostgreSQL. If the community reaches the consensus to modify > the core to introduce a similar feature, we could discuss this as > well. It sounds like a lot of unnecessary work to me though (see > below). > > > * doesn't cover all cases, notably indexes. > > Tom, > > Not sure if I follow. What cases do you have in mind? > > > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. > > Matthias, > > Good point. I suspect that PostGIS is an exception though... Quite a few other non-/common/ extensions are available in RDS[0], some of which are HLL (from citusdata), pglogical (from 2ndQuadrant) and orafce (from Pavel Stehule, orafce et al.). > > I like the idea of the ZSON type, but I'm somewhat disappointed by its > > current limitations > > Several people suggested various enhancements right after learning > about ZSON. Time showed, however, that none of the real-world users > really need e.g. more than one common dictionary per database. I > suspect this is because no one has more than 2**16 repeatable unique > strings (one dictionary limitation) in their documents. Thus there is > no benefit in having separate dictionaries and corresponding extra > complexity. IMO the main benefit of having different dictionaries is that you could have a small dictionary for small and very structured JSONB fields (e.g. some time-series data), and a large one for large / unstructured JSONB fields, without having the significant performance impact of having that large and varied dictionary on the small&structured field. Although a binary search is log(n) and thus still quite cheap even for large dictionaries, the extra size is certainly not free, and you'll be touching more memory in the process. > > - Each dictionary uses a lot of memory, regardless of the number of > > actual stored keys. For 32-bit systems the base usage of a dictionary > > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be > > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly > > more than I'd want to install. > > You are probably right on this one, this part could be optimized. I > will address this if we agree on submitting the patch. > > > - You call gettimeofday() in both dict_get and in get_current_dict_id. > > These functions can be called in short and tight loops (for small GSON > > fields), in which case it would add significant overhead through the > > implied syscalls. > > I must admit, I'm not an expert in this area. My understanding is that > gettimeofday() is implemented as single virtual memory access on > modern operating systems, e.g. VDSO on Linux, thus it's very cheap. > I'm not that sure about other supported platforms though. Probably > worth investigating. Yes, but vDSO does not necessarily work on all systems: e.g. in 2017, a lot on EC2 [1] was run using Xen with vDSO not working for gettimeofday. I'm uncertain if this issue persists for their new KVM/Nitro hypervisor. > > It does mean that you're deTOASTing > > the full GSON field, and that the stored bytestring will not be > > structured / doesn't work well with current debuggers. > > Unfortunately, I'm not very well aware of debugging tools in this > context. Could you please name the debuggers I should take into > account? Hmm, I was mistaken in that regard. I was under the impression that at least one of pageinspect, pg_filedump and pg_hexedit did support column value introspection, which they apparently do not. pg_filedump (and thus pg_hexedit) have some introspection, but none specialized for jsonb (yet). The point I tried to make was that introspection of GSON would be even more difficult due to it adding a non-standard compression method which makes introspection effectively impossible (the algorithm can replace things other than the strings it should replace, so it will be difficult to retrieve structure from the encoded string). With regards, Matthias van de Meent [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.Extensions.13x [1] https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
Re: Replacing pg_depend PIN entries with a fixed range check
Robert Haas writes: > So I think your proposal of allowing genbki-assigned OIDs to be reused > in different catalogs is probably a pretty good one, but I wonder if > we could just rejigger things so that collations just get normal OIDs > > 16384. Hm. I can't readily think of a non-hack way of making that happen. It's also unclear to me how it'd interact with assignment of OIDs to regular user objects. Maybe we'll have to go there eventually, but I'm not in a hurry to. Meanwhile, I'll draft a patch for the other thing. regards, tom lane
Re: CALL versus procedures with output-only arguments
Peter Eisentraut writes: > AFAICT, your patch does not main the property that > CREATE PROCEDURE p1(OUT int, OUT int) > corresponds to > DROP PROCEDURE p1(int, int) > which would be bad. Why? If it actually works that way right now, I'd maintain strenously that it's broken. The latter should be referring to a procedure with two IN arguments. Even if the SQL spec allows fuzziness about that, we cannot afford to, because we have a more generous view of overloading than the spec does. (As far as I could tell from looking at the spec yesterday, they think that you aren't allowed to have two procedures with the same name/schema and same number of arguments, regardless of the details of those arguments. Up with that I will not put.) > I'm not opposed to reverting the feature if we can't find a good > solution in a hurry. I'm not looking to revert the feature. I mainly want a saner catalog representation, and less inconsistency in object naming (which is tightly tied to the first thing). regards, tom lane
Re: Skip partition tuple routing with constant partition key
> > Hi, Amit: > For ConvertTupleToPartition() in 0001-ExecFindPartition-cache-last-used-partition-v3.patch: + if (tempslot != NULL) + ExecClearTuple(tempslot); If tempslot and parent_slot point to the same slot, should ExecClearTuple() still be called ? Cheers
Re: storing an explicit nonce
On Tue, May 25, 2021 at 7:58 PM Stephen Frost wrote: > The simple thought I had was masking them out, yes. No, you can't > re-encrypt a different page with the same nonce. (Re-encrypting the > exact same page with the same nonce, however, just yields the same > cryptotext and therefore is fine). In the interest of not being viewed as too much of a naysayer, let me first reiterate that I am generally in favor of TDE going forward and am not looking to throw up unnecessary obstacles in the way of making that happen. That said, I don't see how this particular idea can work. When we want to write a page out to disk, we need to identify which bits in the page are hint bits, so that we can avoid including them in what is encrypted, which seems complicated and expensive. But even worse, when we then read a page back off of disk, we'd need to decrypt everything except for the hint bits, but how do we know which bits are hint bits if the page isn't decrypted yet? We can't annotate an 8kB page that might be full with enough extra information to say where the non-encrypted parts are and still have the result be guaranteed to fit within 8kb. Also, it's not just hint bits per se, but anything that would cause us to use MarkBufferDirtyHint(). For a btree index, per _bt_check_unique and _bt_killitems, that includes the entire line pointer array, because of how ItemIdMarkDead() is used. Even apart from the problem of how decryption would know which things we encrypted and which things we didn't, I really have a hard time believing that it's OK to exclude the entire line pointer array in every btree page from encryption from a security perspective. Among other potential problems, that's leaking all the information an attacker could possibly want to have about where their known plaintext might occur in the page. However, I believe that if we store the nonce in the page explicitly, as proposed here, rather trying to derive it from the LSN, then we don't need to worry about this kind of masking, which I think is better from both a security perspective and a performance perspective. There is one thing I'm not quite sure about, though. I had previously imagined that each page would have a nonce and we could just do nonce++ each time we write the page. But that doesn't quite work if the standby can do more writes of the same page than the master. One vague idea I have for fixing this is: let each page's 16-byte nonce consist of 8 random bytes and an 8-byte counter that will be incremented on every write. But, the first time a standby writes each page, force a "key rotation" where the 8-byte random value is replaced with a new one, different one from what the master is using for that page. Detecting this is a bit expensive, because it probably means we need to store the TLI that last wrote each page on every page too, but maybe it could be made to work; we're talking about a feature that is expensive by nature. However, I'm a little worried about the cryptographic properties of this approach. It would often mean that an attacker who has full filesystem access can get multiple encrypted images of the same data, each encrypted with a different nonce. I don't know whether that's a hazard or not, but it feels like the sort of thing that, if I were a cryptographer, I would be pleased to have. Another idea might be - instead of doing nonce++ every time we write the page, do nonce=random(). That's eventually going to repeat a value, but it's extremely likely to take a *super* long time if there are enough bits. A potentially rather large problem, though, is that generating random numbers in large quantities isn't very cheap. Anybody got a better idea? I really like your (Stephen's) idea of including something in the special space that permits integrity checking. One thing that is quite nice about that is we could do it first, as an independent patch, before we did TDE. It would be an independently useful feature, and it would mean that if there are any problems with the code that injects stuff into the special space, we could try to track those down in a non-TDE context. That's really good, because in a TDE context, the pages are going to be garbled and unreadable (we hope, anyway). If we have a problem that we can reproduce with just an integrity-checking token shoved into every page, you can look at the page and try to understand what went wrong. So I really like this direction both from the point of view of improving integrity checking, and also from the point of view of being able to debug problems. Now, one downside of this approach is that if we have the ability to turn integrity-checking tokens on and off, and separately we can turn encryption on and off, then we can't simplify down to two cases as Andres was advocating above; you have to cater to a variety of possible values of how-much-stuff-we-squeezed-into-the-special space. At that point you kind of end up with the approach the draft patches were already taking,
Re: storing an explicit nonce
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Tue, May 25, 2021 at 7:58 PM Stephen Frost wrote: > > The simple thought I had was masking them out, yes. No, you can't > > re-encrypt a different page with the same nonce. (Re-encrypting the > > exact same page with the same nonce, however, just yields the same > > cryptotext and therefore is fine). > > In the interest of not being viewed as too much of a naysayer, let me > first reiterate that I am generally in favor of TDE going forward and > am not looking to throw up unnecessary obstacles in the way of making > that happen. Quite glad to hear that. Hopefully we'll all be able to get on the same page to move TDE forward. > That said, I don't see how this particular idea can work. When we want > to write a page out to disk, we need to identify which bits in the > page are hint bits, so that we can avoid including them in what is > encrypted, which seems complicated and expensive. But even worse, when > we then read a page back off of disk, we'd need to decrypt everything > except for the hint bits, but how do we know which bits are hint bits > if the page isn't decrypted yet? We can't annotate an 8kB page that > might be full with enough extra information to say where the > non-encrypted parts are and still have the result be guaranteed to fit > within 8kb. Yeah, Andres pointed that out and it's certainly an issue with this general idea. > Also, it's not just hint bits per se, but anything that would cause us > to use MarkBufferDirtyHint(). For a btree index, per _bt_check_unique > and _bt_killitems, that includes the entire line pointer array, > because of how ItemIdMarkDead() is used. Even apart from the problem > of how decryption would know which things we encrypted and which > things we didn't, I really have a hard time believing that it's OK to > exclude the entire line pointer array in every btree page from > encryption from a security perspective. Among other potential > problems, that's leaking all the information an attacker could > possibly want to have about where their known plaintext might occur in > the page. Also a good point. > However, I believe that if we store the nonce in the page explicitly, > as proposed here, rather trying to derive it from the LSN, then we > don't need to worry about this kind of masking, which I think is > better from both a security perspective and a performance perspective. > There is one thing I'm not quite sure about, though. I had previously > imagined that each page would have a nonce and we could just do > nonce++ each time we write the page. But that doesn't quite work if > the standby can do more writes of the same page than the master. One > vague idea I have for fixing this is: let each page's 16-byte nonce > consist of 8 random bytes and an 8-byte counter that will be > incremented on every write. But, the first time a standby writes each > page, force a "key rotation" where the 8-byte random value is replaced > with a new one, different one from what the master is using for that > page. Detecting this is a bit expensive, because it probably means we > need to store the TLI that last wrote each page on every page too, but > maybe it could be made to work; we're talking about a feature that is > expensive by nature. However, I'm a little worried about the > cryptographic properties of this approach. It would often mean that an > attacker who has full filesystem access can get multiple encrypted > images of the same data, each encrypted with a different nonce. I > don't know whether that's a hazard or not, but it feels like the sort > of thing that, if I were a cryptographer, I would be pleased to have. I do agree that, in general, this is a feature that's expensive to begin with and folks are generally going to be accepting of that. Encrypting the same data with different nonces will produce different results and shouldn't be an issue. The nonces really do need to be unique for a given key though. > Another idea might be - instead of doing nonce++ every time we write > the page, do nonce=random(). That's eventually going to repeat a > value, but it's extremely likely to take a *super* long time if there > are enough bits. A potentially rather large problem, though, is that > generating random numbers in large quantities isn't very cheap. There's specific discussion about how to choose a nonce in NIST publications and using a properly random one that's large enough is one accepted approach, though my recollection was that the preference was to use an incrementing guaranteed-unique nonce and using a random one was more of a "if you can't coordinate using an incrementing one then you can do this". I can try to hunt for the specifics on that though. The issue of getting large amounts of cryptographically random numbers seems very likely to make this not work so well though. > Anybody got a better idea? If we stipulate (and document) that all replicas need their own keys th
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Robert Haas writes: > On Wed, May 26, 2021 at 11:13 AM Tom Lane wrote: >> * As things stand here, once you've applied ALTER ... SET COMPRESSION >> to select a specific method, there is no way to undo that and go >> back to the use-the-default setting. All you can do is change to >> explicitly select the other method. Should we invent "ALTER ... >> SET COMPRESSION default" or the like to cover that? > Yes. Irreversible catalog changes are bad. Here's an add-on 0004 that does that, and takes care of assorted silliness in the grammar and docs --- did you know that this patch caused alter table foo alter column bar set ; to be allowed? I think this is about ready to commit now (though I didn't yet nuke GetDefaultToastCompression). regards, tom lane diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 95a302ffee..9f7f42c4aa 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8242,11 +8242,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; compression method for values of compressible columns. (This can be overridden for individual columns by setting the COMPRESSION column option in -CREATE TABLE or +CREATE TABLE or ALTER TABLE.) -The supported compression methods are pglz and, -if PostgreSQL was compiled with ---with-lz4, lz4. +The supported compression methods are pglz and +(if PostgreSQL was compiled with +--with-lz4) lz4. The default is pglz. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3a21129021..08b07f561e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26253,10 +26253,10 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); pg_column_compression pg_column_compression ( "any" ) -integer +text -Shows the compression algorithm that was used to compress a +Shows the compression algorithm that was used to compress an individual variable-length value. Returns NULL if the value is not compressed. diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 1431d2649b..939d3fe273 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -104,7 +104,6 @@ WITH ( MODULUS numeric_literal, REM GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] | UNIQUE index_parameters | PRIMARY KEY index_parameters | - COMPRESSION compression_method | REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE referential_action ] [ ON UPDATE referential_action ] } [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] @@ -391,24 +390,27 @@ WITH ( MODULUS numeric_literal, REM - This sets the compression method to be used for data inserted into a column. - + This form sets the compression method for a column, determining how + values inserted in future will be compressed (if the storage mode + permits compression at all). This does not cause the table to be rewritten, so existing data may still be compressed with other compression methods. If the table is rewritten with VACUUM FULL or CLUSTER, or restored - with pg_restore, then all tuples are rewritten - with the configured compression methods. - - Also, note that when data is inserted from another relation (for example, - by INSERT ... SELECT), tuples from the source data are - not necessarily detoasted, and any previously compressed data is retained - with its existing compression method, rather than recompressing with the - compression methods of the target columns. - + with pg_restore, then all values are rewritten + with the configured compression method. + However, when data is inserted from another relation (for example, + by INSERT ... SELECT), values from the source table are + not necessarily detoasted, so any previously compressed data may retain + its existing compression method, rather than being recompressed with the + compression method of the target column. The supported compression methods are pglz and lz4. - lz4 is available only if --with-lz4 - was used when building PostgreSQL. + (lz4 is available only if --with-lz4 + was used when building PostgreSQL.) In + addition, compression_method + can be default, which selects the default behavior of + consulting the setting + at the time of data insertion to determine the method to use. diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index a8c5e4028a..c6d0a35e50 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Wed, May 26, 2021 at 11:13:46AM -0400, Tom Lane wrote: > Michael Paquier writes: > > Ah, the parallel with reltablespace and default_tablespace at database > > level is a very good point. It is true that currently the code would > > assign attcompression to a non-zero value once the relation is defined > > depending on default_toast_compression set for the database, but > > setting it to 0 in this case would be really helpful to change the > > compression methods of all the relations if doing something as crazy > > as a VACUUM FULL for this database. Count me as convinced. > > Here's a draft patch series to address this. > > 0001 removes the relkind checks I was questioning originally. > As expected, this results in zero changes in check-world results. > > 0002 is the main change in the semantics of attcompression. > This does change the results of compression.sql, but in what > seem to me to be expected ways: a column's compression option > is now shown in \d+ output only if you explicitly set it. > > 0003 further removes pg_dump's special handling of > default_toast_compression. I don't think we need that anymore. > AFAICS its only effect would be to override the receiving server's > default_toast_compression setting for dumped/re-loaded data, which > does not seem like a behavior that anyone would want. > > Loose ends: > > * I've not reviewed the docs fully; there are likely some more > things that need updated. > > * As things stand here, once you've applied ALTER ... SET COMPRESSION > to select a specific method, there is no way to undo that and go > back to the use-the-default setting. All you can do is change to > explicitly select the other method. Should we invent "ALTER ... > SET COMPRESSION default" or the like to cover that? (Since > DEFAULT is a reserved word, that exact syntax might be a bit of > a pain to implement, but maybe we could think of another word.) > > * I find GetDefaultToastCompression() annoying. I do not think > it is project style to invent trivial wrapper functions around > GUC variable references: it buys nothing while requiring readers > to remember one more name than they would otherwise. Since there > are only two uses remaining, maybe this isn't very important either > way, but I'm still inclined to flush it. +1 It existed when default_toast_compression was a text string. Since e5595de03, it's an enum/int/char, and serves no purpose. -- Justin
Re: storing an explicit nonce
On Wed, May 26, 2021 at 07:14:47AM +0200, Antonin Houska wrote: > Bruce Momjian wrote: > > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote: > > > > So, let me ask --- I thought CTR basically took an encrypted stream of > > > > bits and XOR'ed them with the data. If that is true, then why are > > > > changing hint bits a problem? We already can see some of the bit stream > > > > by knowing some bytes of the page. > > > > > > A *single* reuse of the nonce in CTR reveals nearly all of the > > > plaintext. As you say, the data is XORed with the key stream. Reusing > > > the nonce means that you reuse the key stream. Which in turn allows you > > > to do: > > > (data ^ stream) ^ (data' ^ stream) > > > which can be simplified to > > > (data ^ data') > > > thereby leaking all of data except the difference between data and > > > data'. That's why it's so crucial to ensure that stream *always* differs > > > between two rounds of encrypting "related" data. > > > > > > We can't just "hope" that data doesn't change and use CTR. > > > > My point was about whether we need to change the nonce, and hence > > WAL-log full page images if we change hint bits. If we don't and > > reencrypt the page with the same nonce, don't we only expose the hint > > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > > cases. > > > > I don't understand your computation above. You decrypt the page into > > shared buffers, you change a hint bit, and rewrite the page. You are > > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > > only change the hint bits in the new write? > > The way I view things is that the CTR mode encrypts each individual bit, > independent from any other bit on the page. For non-hint bits data=data', so > (data ^ data') is always zero, regardless the actual values of the data. So I > agree with you that by reusing the nonce we only expose the hint bits. OK, that's what I thought. We already expose the clog and fsm, so exposing the hint bits seems acceptable. If everyone agrees, I will adjust my patch to not WAL log hint bit changes. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
Greetings, * Stephen Frost (sfr...@snowman.net) wrote: > * Robert Haas (robertmh...@gmail.com) wrote: > > Another idea might be - instead of doing nonce++ every time we write > > the page, do nonce=random(). That's eventually going to repeat a > > value, but it's extremely likely to take a *super* long time if there > > are enough bits. A potentially rather large problem, though, is that > > generating random numbers in large quantities isn't very cheap. > > There's specific discussion about how to choose a nonce in NIST > publications and using a properly random one that's large enough is > one accepted approach, though my recollection was that the preference > was to use an incrementing guaranteed-unique nonce and using a random > one was more of a "if you can't coordinate using an incrementing one > then you can do this". I can try to hunt for the specifics on that > though. Disucssion of generating IVs here: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf section 8.2 specifically. Note that 8.3 also discusses subsequent limitations which one should follow when using a random nonce, to reduce the chances of a collision. Thanks, Stephen signature.asc Description: PGP signature
Re: storing an explicit nonce
On Wed, May 26, 2021 at 2:37 PM Stephen Frost wrote: > > Anybody got a better idea? > > If we stipulate (and document) that all replicas need their own keys > then we no longer need to worry about nonce re-use between the primary > and the replica. Not sure that's *better*, per se, but I do think it's > worth consideration. Teaching pg_basebackup how to decrypt and then > re-encrypt with a different key wouldn't be challenging. I agree that we could do that and that it's possible worth considering. However, it would be easy - and tempting - for users to violate the no-nonce-reuse principle. For example, consider a hypothetical user who takes a backup on Monday via a filesystem snapshot - which might be either (a) a snapshot of the cluster while it is stopped, or (b) a snapshot of the cluster while it's running, from which crash recovery can be safely performed as long as it's a true atomic snapshot, or (c) a snapshot taken between pg_start_backup and pg_stop_backup which will be used just like a backup taken by pg_basebackup. In any of these cases, there's no opportunity for a tool we provide to intervene and re-key. Now, we would provide a tool that re-keys in such situations and tell people to be sure they run it before using any of those backups, and maybe that's the best we can do. However, that tool is going to run for a good long time because it has to rewrite the entire cluster, so someone with a terabyte-scale database is going to be sorely tempted to skip this "unnecessary" and time-consuming step. If it were possible to set things up so that good things happen automatically and without user action, that would be swell. Here's another idea: suppose that a nonce is 128 bits, 64 of which are randomly generated at server startup, and the other 64 of which are a counter. If you're willing to assume that the 64 bits generated randomly at server startup are not going to collide in practice, because the number of server lifetimes per key should be very small compared to 2^64, then this gets you the benefits of a randomly-generate nonce without needing to keep on generating new cryptographically strong random numbers, and pretty much regardless of what users do with their backups. If you replay an FPI, you can write out the page exactly as you got it from the master, without re-encrypting. If you modify and then write a page, you generate a nonce for it containing your own server lifetime identifier. > Yes, if the amount of space available is variable then there's an added > cost for that. While I appreciate the concern about having that be > expensive, for my 2c at least, I like to think that having this sudden > space that's available for use may lead to other really interesting > capabilities beyond the ones we're talking about here, so I'm not really > thrilled with the idea of boiling it down to just two cases. Although I'm glad you like some things about this idea, I think the proposed system will collapse if we press it too hard. We're going to need to be judicious. > One thing to be absolutely clear about here though is that simply taking > a hash() of the ciphertext and storing that with the data does *not* > provide cryptographic data integrity validation for the page because it > doesn't involve the actual key or IV at all and the hash is done after > the ciphertext is generated- therefore an attacker can change the data > and just change the hash to match and you'd never know. Ah, right. So you'd actually want something more like hash(dboid||tsoid||relfilenode||blockno||block_contents||secret). Maybe not generated exactly that way: perhaps the secret is really the IV for the hash function rather than part of the hashed data, or whatever. However you do it exactly, it prevents someone from verifying - or faking - a signature unless they have the secret. > very hard for the attacker to discover) and suddently you're doing what > AES-GCM *already* does for you, except you're trying to hack it yourself > instead of using the tools available which were written by experts. I am all in favor of using the expert-written tools provided we can figure out how to do it in a way we all agree is correct. > What this means for your proposal above is that the actual data > validation information will be generated in two different ways depending > on if we're using AES-GCM and doing TDE, or if we're doing just the data > validation piece and not encrypting anything. That's maybe not ideal > but I don't think it's a huge issue either and your proposal will still > address the question of if we end up missing anything when it comes to > how the special area is handled throughout the code. Hmm. Is there no expert-written method for this sort of thing without encryption? One thing that I think would be really helpful is to be able to take a TDE-ified cluster and run it through decryption, ending up with a cluster that still has extra special space but which isn't actually encrypted any more. Ideally
Re: storing an explicit nonce
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > OK, that's what I thought. We already expose the clog and fsm, so > exposing the hint bits seems acceptable. If everyone agrees, I will > adjust my patch to not WAL log hint bit changes. Robert pointed out that it's not just hint bits where this is happening though, but it can also happen with btree line pointer arrays. Even if we were entirely comfortable accepting that the hint bits are leaked because of this, leaking the btree line pointer array doesn't seem like it could possibly be acceptable.. I've not run down that code myself, but I don't have any reason to doubt Robert's assessment. Thanks, Stephen signature.asc Description: PGP signature
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
I wrote: > I think this is about ready to commit now (though I didn't yet nuke > GetDefaultToastCompression). Here's a bundled-up final version, in case anybody would prefer to review it that way. regards, tom lane diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index b26b872a06..16493209c6 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1261,10 +1261,14 @@ attcompression char - The current compression method of the column. If it is an invalid - compression method ('\0') then column data will not - be compressed. Otherwise, 'p' = pglz compression or - 'l' = LZ4 compression. + The current compression method of the column. Typically this is + '\0' to specify use of the current default setting + (see ). Otherwise, + 'p' selects pglz compression, while + 'l' selects LZ4 + compression. However, this field is ignored + whenever attstorage does not allow + compression. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 7e32b0686c..9f7f42c4aa 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8239,13 +8239,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; This variable sets the default TOAST -compression method for columns of newly-created tables. The -CREATE TABLE statement can override this default -by specifying the COMPRESSION column option. - -The supported compression methods are pglz and, -if PostgreSQL was compiled with ---with-lz4, lz4. +compression method for values of compressible columns. +(This can be overridden for individual columns by setting +the COMPRESSION column option in +CREATE TABLE or +ALTER TABLE.) +The supported compression methods are pglz and +(if PostgreSQL was compiled with +--with-lz4) lz4. The default is pglz. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3a21129021..08b07f561e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26253,10 +26253,10 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); pg_column_compression pg_column_compression ( "any" ) -integer +text -Shows the compression algorithm that was used to compress a +Shows the compression algorithm that was used to compress an individual variable-length value. Returns NULL if the value is not compressed. diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 1431d2649b..939d3fe273 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -104,7 +104,6 @@ WITH ( MODULUS numeric_literal, REM GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] | UNIQUE index_parameters | PRIMARY KEY index_parameters | - COMPRESSION compression_method | REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE referential_action ] [ ON UPDATE referential_action ] } [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] @@ -391,24 +390,27 @@ WITH ( MODULUS numeric_literal, REM - This sets the compression method to be used for data inserted into a column. - + This form sets the compression method for a column, determining how + values inserted in future will be compressed (if the storage mode + permits compression at all). This does not cause the table to be rewritten, so existing data may still be compressed with other compression methods. If the table is rewritten with VACUUM FULL or CLUSTER, or restored - with pg_restore, then all tuples are rewritten - with the configured compression methods. - - Also, note that when data is inserted from another relation (for example, - by INSERT ... SELECT), tuples from the source data are - not necessarily detoasted, and any previously compressed data is retained - with its existing compression method, rather than recompressing with the - compression methods of the target columns. - + with pg_restore, then all values are rewritten + with the configured compression method. + However, when data is inserted from another relation (for example, + by INSERT ... SELECT), values from the source table are + not necessarily detoasted, so any previously compressed data may retain + its existing compression method, rather than being recompressed with the + compression method of the target column. The supported compression methods are pglz and lz4. - lz4 is available only if --with-lz4 - was used when building PostgreSQL. + (lz4 i
Re: Commitfest app vs. pgsql-docs
On Mon, May 24, 2021 at 7:21 PM Julien Rouhaud wrote: > > On Mon, May 24, 2021 at 11:03 PM Andrew Dunstan wrote: > > > > On 5/24/21 10:55 AM, Magnus Hagander wrote: > > > On Mon, May 24, 2021 at 4:18 PM Andrew Dunstan > > > wrote: > > >> > > >> On 5/24/21 8:42 AM, Daniel Gustafsson wrote: > > On 24 May 2021, at 11:47, Magnus Hagander wrote: > > > > On Wed, May 19, 2021 at 11:08 PM Alvaro Herrera > > wrote: > > > On 2021-May-19, Andrew Dunstan wrote: > > > > > >> It's just a reference after all. So someone supplies a reference to > > >> an > > >> email on an out of the way list. What's the evil that will occur? Not > > >> much really AFAICT. > > Well, if you include all lists, the ability for you to findi things by > > the "most recent posts" or by searching for anything other than a > > unique message id will likely become less useful. > > >>> Thats a good case for restricting this to the smaller set of lists > > >>> which will > > >>> cover most submissions anyways. With a smaller set we could make the > > >>> UX still > > >>> work without presenting an incredibly long list. > > >>> > > >>> Or, the most recent emails dropdown only cover a set of common lists but > > >>> a search will scan all lists? > > >>> > > As long as you only ever search by message-id it won't make a > > difference. > > >>> Without any supporting evidence to back it up, my gut feeling tells me > > >>> the most > > >>> recent mails list is a good/simple way to lower the bar for submissions. > > >>> > > >> Maybe. I only ever do this by using an exact message-id, since that's > > >> what the web form specifically asks for :-) > > > The webform lets you either do a free text search, or pick from a > > > list, or enter a message-id, no? > > > > > > > > Yes it does, but the text next to the field says "Specify thread msgid:". > > Yes, I've always been confused by that form. I may have tried to > enter some free text once but AFAIR I always use the specific > message-id. This is clearly in need of a better UX. Any suggestions on how would be welcome. Would be enough to just say "Or specify... "? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: storing an explicit nonce
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote: > However, I believe that if we store the nonce in the page explicitly, > as proposed here, rather trying to derive it from the LSN, then we > don't need to worry about this kind of masking, which I think is > better from both a security perspective and a performance perspective. You are saying that by using a non-LSN nonce, you can write out the page with a new nonce, but the same LSN, and also discard the page during crash recovery and use the WAL copy? I am confused why checksums, which are widely used, acceptably require wal_log_hints, but there is concern that file encryption, which is heavier, cannot acceptably require wal_log_hints. I must be missing something. Why can't checksums also throw away hint bit changes like you want to do for file encryption and not require wal_log_hints? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, May 26, 2021 at 2:37 PM Stephen Frost wrote: > > > Anybody got a better idea? > > > > If we stipulate (and document) that all replicas need their own keys > > then we no longer need to worry about nonce re-use between the primary > > and the replica. Not sure that's *better*, per se, but I do think it's > > worth consideration. Teaching pg_basebackup how to decrypt and then > > re-encrypt with a different key wouldn't be challenging. > > I agree that we could do that and that it's possible worth > considering. However, it would be easy - and tempting - for users to > violate the no-nonce-reuse principle. For example, consider a (guessing you meant no-key-reuse above) > hypothetical user who takes a backup on Monday via a filesystem > snapshot - which might be either (a) a snapshot of the cluster while > it is stopped, or (b) a snapshot of the cluster while it's running, > from which crash recovery can be safely performed as long as it's a > true atomic snapshot, or (c) a snapshot taken between pg_start_backup > and pg_stop_backup which will be used just like a backup taken by > pg_basebackup. In any of these cases, there's no opportunity for a > tool we provide to intervene and re-key. Now, we would provide a tool > that re-keys in such situations and tell people to be sure they run it > before using any of those backups, and maybe that's the best we can > do. However, that tool is going to run for a good long time because it > has to rewrite the entire cluster, so someone with a terabyte-scale > database is going to be sorely tempted to skip this "unnecessary" and > time-consuming step. If it were possible to set things up so that good > things happen automatically and without user action, that would be > swell. Yes, if someone were to use a snapshot and set up a replica from it they'd end up with the same key being used and potentially have an issue with the key+nonce combination being re-used between the primary and replica with different data leading to a possible data leak. > Here's another idea: suppose that a nonce is 128 bits, 64 of which are > randomly generated at server startup, and the other 64 of which are a > counter. If you're willing to assume that the 64 bits generated > randomly at server startup are not going to collide in practice, > because the number of server lifetimes per key should be very small > compared to 2^64, then this gets you the benefits of a > randomly-generate nonce without needing to keep on generating new > cryptographically strong random numbers, and pretty much regardless of > what users do with their backups. If you replay an FPI, you can write > out the page exactly as you got it from the master, without > re-encrypting. If you modify and then write a page, you generate a > nonce for it containing your own server lifetime identifier. Yes, this kind of approach is discussed in the NIST publication in section 8.2.2. We'd have to keep track of what nonce we used for which page, of course, but that should be alright using the special space as discussed. > > Yes, if the amount of space available is variable then there's an added > > cost for that. While I appreciate the concern about having that be > > expensive, for my 2c at least, I like to think that having this sudden > > space that's available for use may lead to other really interesting > > capabilities beyond the ones we're talking about here, so I'm not really > > thrilled with the idea of boiling it down to just two cases. > > Although I'm glad you like some things about this idea, I think the > proposed system will collapse if we press it too hard. We're going to > need to be judicious. Sure. > > One thing to be absolutely clear about here though is that simply taking > > a hash() of the ciphertext and storing that with the data does *not* > > provide cryptographic data integrity validation for the page because it > > doesn't involve the actual key or IV at all and the hash is done after > > the ciphertext is generated- therefore an attacker can change the data > > and just change the hash to match and you'd never know. > > Ah, right. So you'd actually want something more like > hash(dboid||tsoid||relfilenode||blockno||block_contents||secret). > Maybe not generated exactly that way: perhaps the secret is really the > IV for the hash function rather than part of the hashed data, or > whatever. However you do it exactly, it prevents someone from > verifying - or faking - a signature unless they have the secret. > > > very hard for the attacker to discover) and suddently you're doing what > > AES-GCM *already* does for you, except you're trying to hack it yourself > > instead of using the tools available which were written by experts. > > I am all in favor of using the expert-written tools provided we can > figure out how to do it in a way we all agree is correct. In the patch set that Bruce has which uses the OpenSSL functions
Re: storing an explicit nonce
On Wed, May 26, 2021 at 03:49:43PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (br...@momjian.us) wrote: > > OK, that's what I thought. We already expose the clog and fsm, so > > exposing the hint bits seems acceptable. If everyone agrees, I will > > adjust my patch to not WAL log hint bit changes. > > Robert pointed out that it's not just hint bits where this is happening > though, but it can also happen with btree line pointer arrays. Even if > we were entirely comfortable accepting that the hint bits are leaked > because of this, leaking the btree line pointer array doesn't seem like > it could possibly be acceptable.. > > I've not run down that code myself, but I don't have any reason to doubt > Robert's assessment. OK, I guess we could split out log_hints to maybe just FPW-log btree changes or something, but my recent email questions why wal_log_hints is an issue anyway. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: storing an explicit nonce
On Wed, May 26, 2021 at 04:40:48PM -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote: > > However, I believe that if we store the nonce in the page explicitly, > > as proposed here, rather trying to derive it from the LSN, then we > > don't need to worry about this kind of masking, which I think is > > better from both a security perspective and a performance perspective. > > You are saying that by using a non-LSN nonce, you can write out the page > with a new nonce, but the same LSN, and also discard the page during > crash recovery and use the WAL copy? > > I am confused why checksums, which are widely used, acceptably require > wal_log_hints, but there is concern that file encryption, which is > heavier, cannot acceptably require wal_log_hints. I must be missing > something. > > Why can't checksums also throw away hint bit changes like you want to do > for file encryption and not require wal_log_hints? One detail might be this extra hint bit FPW case: https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch However, if a hint-bit-modified page is written to the file system during a checkpoint, and there is a later hint bit change switching the same page from clean to dirty during the same checkpoint, we need a new LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this is to update the page LSN by writing a dummy WAL record via xloginsert.c::LSNForEncryption() in such cases. Is this how file encryption is different from checksum wal_log_hints, and the big concern? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Add PortalDrop in exec_execute_message
On 2021-May-25, Yura Sokolov wrote: > Tom Lane писал 2021-05-21 21:23: > > Yura Sokolov writes: > > > I propose to add PortalDrop at the 'if (completed)' branch of > > > exec_execute_message. > > > > This violates our wire protocol specification, which > > specifically says > > > > If successfully created, a named portal object lasts till the end of > > the current transaction, unless explicitly destroyed. An unnamed > > portal is destroyed at the end of the transaction, or as soon as the > > next Bind statement specifying the unnamed portal as destination is > > issued. (Note that a simple Query message also destroys the unnamed > > portal.) > > > > I'm inclined to think that your complaint would be better handled > > by having the client send a portal-close command, if it's not > > going to do something else immediately. > > I thought about it as well. Then, if I understand correctly, > PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send > "close portal" (CP) message after "execute" message, right? I don't think they should do that. The portal remains open, and the libpq interface does that. The portal gets closed at end of transaction without the need for any client message. I think if the client wanted to close the portal ahead of time, it would need a new libpq entry point to send the message to do that. (I didn't add a Close Portal message to PQsendQueryInternal in pipeline mode precisely because there is no such message in PQsendQueryGuts. I think it would be wrong to unconditionally add a Close Portal message to any of those places.) -- Álvaro Herrera39°49'30"S 73°17'W
Re: storing an explicit nonce
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote: > In the interest of not being viewed as too much of a naysayer, let me > first reiterate that I am generally in favor of TDE going forward and > am not looking to throw up unnecessary obstacles in the way of making > that happen. Rather than surprise anyone, I might as well just come out and say some things. First, I have always admitted this feature has limited usefulness. I think a non-LSN nonce adds a lot of code complexity, which adds a code and maintenance burden. It also prevents the creation of an encrypted replica from a non-encrypted primary using binary replication, which makes deployment harder. Take a feature of limited usefulness, add code complexity and deployment difficulty, and the feature becomes even less useful. For these reasons, if we decide to go in the direction of using a non-LSN nonce, I no longer plan to continue working on this feature. I would rather work on things that have a more positive impact. Maybe a non-LSN nonce is a better long-term plan, but there are too many unknowns and complexity for me to feel comfortable with it. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Add PortalDrop in exec_execute_message
Alvaro Herrera writes: > (I didn't add a Close Portal message to PQsendQueryInternal in pipeline > mode precisely because there is no such message in PQsendQueryGuts. > I think it would be wrong to unconditionally add a Close Portal message > to any of those places.) Yeah, I'm not very comfortable with having libpq take it on itself to do that, either. Looking back at the original complaint, it seems like it'd be fair to wonder why we're still holding a page pin in a supposedly completed executor run. Maybe the right fix is somewhere in the executor scan logic. regards, tom lane
Speed up pg_checksums in cases where checksum already set
The attached patch makes an optimization to pg_checksums which prevents rewriting the block if the checksum is already what we expect. This can lead to much faster runs in cases where it is already set (e.g. enabled -> disabled -> enable, external helper process, interrupted runs, future parallel processes). There is also an effort to not sync the data directory if no changes were written. Finally, added a bit more output on how many files were actually changed, e.g.: Checksum operation completed Files scanned: 1236 Blocks scanned: 23283 Files modified: 38 Blocks modified: 19194 pg_checksums: syncing data directory pg_checksums: updating control file Checksums enabled in cluster Cheers, Greg pg_checksums.optimize.writes.patch Description: Binary data
Re: Add ZSON extension to /contrib/
On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote: > Hi hackers, > > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. > The > extension introduces the new ZSON type, which is 100% compatible with JSONB > but > uses a shared dictionary of strings most frequently used in given JSONB > documents for compression. These strings are replaced with integer IDs. > Afterward, PGLZ (and now LZ4) applies if the document is large enough by > common > PostgreSQL logic. Under certain conditions (many large documents), this saves > disk space, memory and increases the overall performance. More details can be > found in README on GitHub. I think this is interesting because it is one of the few cases that allow compression outside of a single column. Here is a list of compression options: https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020 1. single field 2. across rows in a single page 3. across rows in a single column 4. across all columns and rows in a table 5. across tables in a database 6. across databases While standard Postgres does #1, ZSON allows 2-5, assuming the data is in the ZSON data type. I think this cross-field compression has great potential for cases where the data is not relational, or hasn't had time to be structured relationally. It also opens questions of how to do this cleanly in a relational system. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Incorrect usage of strtol, atoi for non-numeric junk inputs
On 2021-May-19, Bharath Rupireddy wrote: > While working on [1], I found that some parts of the code is using > strtol and atoi without checking for non-numeric junk input strings. I > found this strange. Most of the time users provide proper numeric > strings but there can be some scenarios where these strings are not > user-supplied but generated by some other code which may contain > non-numeric strings. Shouldn't the code use strtol or atoi > appropriately and error out in such cases? One way to fix this once > and for all is to have a common API something like int > pg_strtol/pg_str_convert_to_int(char *opt_name, char *opt_value) which > returns a generic message upon invalid strings ("invalid value \"%s\" > is provided for option \"%s\", opt_name, opt_value) and returns > integers on successful parsing. Hi, how is this related to https://postgr.es/m/20191028012000.ga59...@begriffs.com ? -- Álvaro Herrera Valdivia, Chile
Reducing the range of OIDs consumed by genbki.pl
The attached patch stems from the conversation at [1]; I'm starting a new thread to avoid confusing the cfbot. Briefly, the idea is to allow reverting the change made in commit ab596105b to increase FirstBootstrapObjectId from 12000 to 13000, by teaching genbki.pl to assign OIDs independently in each catalog rather than from a single OID counter. Thus, the OIDs in this range will not be globally unique anymore, but only unique per-catalog. The aforesaid commit had to increase FirstBootstrapObjectId because as of HEAD, genbki.pl needs to consume OIDs up through 12035, overrunning the old limit of 12000. But moving up that limit seems a bit risky, cf [2]. It'd be better if we could avoid doing that. Since the OIDs in question are spread across several catalogs, allocating them per-catalog seems to fix the problem quite effectively. With the attached patch, the ending OID counters are GenbkiNextOid(pg_amop) = 10945 GenbkiNextOid(pg_amproc) = 10697 GenbkiNextOid(pg_cast) = 10230 GenbkiNextOid(pg_opclass) = 10164 so we have quite a lot of daylight before we'll ever approach 12000 again. Per-catalog OID uniqueness shouldn't be a problem here, because any code that's assuming global uniqueness is broken anyway; no such guarantee exists after the OID counter has wrapped around. So I propose shoehorning this into v14, to avoid shipping a release where FirstBootstrapObjectId has been bumped up. regards, tom lane [1] https://www.postgresql.org/message-id/flat/3737988.1618451008%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/flat/CAGPqQf3JYTrTB1E1fu_zOGj%2BrG_kwTfa3UcUYPfNZL9o1bcYNw%40mail.gmail.com diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml index b33e59d5e4..db1b3d5e9a 100644 --- a/doc/src/sgml/bki.sgml +++ b/doc/src/sgml/bki.sgml @@ -418,11 +418,11 @@ If genbki.pl needs to assign an OID to a catalog entry that does not have a manually-assigned OID, it will use a value in -the range 1—12999. The server's OID counter is set to 13000 +the range 1—11999. The server's OID counter is set to 12000 at the start of a bootstrap run. Thus objects created by regular SQL commands during the later phases of bootstrap, such as objects created while running the information_schema.sql script, -receive OIDs of 13000 or above. +receive OIDs of 12000 or above. diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl index 112b3affdf..e5e774267e 100644 --- a/src/backend/catalog/genbki.pl +++ b/src/backend/catalog/genbki.pl @@ -167,15 +167,17 @@ foreach my $oid (keys %oidcounts) die "found $found duplicate OID(s) in catalog data\n" if $found; -# Oids not specified in the input files are automatically assigned, +# OIDs not specified in the input files are automatically assigned, # starting at FirstGenbkiObjectId, extending up to FirstBootstrapObjectId. +# We allow such OIDs to be assigned independently within each catalog. my $FirstGenbkiObjectId = Catalog::FindDefinedSymbol('access/transam.h', $include_path, 'FirstGenbkiObjectId'); my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol('access/transam.h', $include_path, 'FirstBootstrapObjectId'); -my $GenbkiNextOid = $FirstGenbkiObjectId; +# Hash of next available OID, indexed by catalog name. +my %GenbkiNextOids; # Fetch some special data that we will substitute into the output file. @@ -563,8 +565,7 @@ EOM # Assign oid if oid column exists and no explicit assignment in row if ($attname eq "oid" and not defined $bki_values{$attname}) { -$bki_values{$attname} = $GenbkiNextOid; -$GenbkiNextOid++; +$bki_values{$attname} = assign_next_oid($catname); } # Replace OID synonyms with OIDs per the appropriate lookup rule. @@ -669,11 +670,6 @@ foreach my $declaration (@index_decls) # last command in the BKI file: build the indexes declared above print $bki "build indices\n"; -# check that we didn't overrun available OIDs -die - "genbki OID counter reached $GenbkiNextOid, overrunning FirstBootstrapObjectId\n" - if $GenbkiNextOid > $FirstBootstrapObjectId; - # Now generate system_constraints.sql foreach my $c (@system_constraints) @@ -1081,6 +1077,25 @@ sub form_pg_type_symbol return $name . $arraystr . 'OID'; } +# Assign an unused OID within the specified catalog. +sub assign_next_oid +{ + my $catname = shift; + + # Initialize, if no previous request for this catalog. + $GenbkiNextOids{$catname} = $FirstGenbkiObjectId + if !defined($GenbkiNextOids{$catname}); + + my $result = $GenbkiNextOids{$catname}++; + + # Check that we didn't overrun available OIDs + die + "genbki OID counter for $catname reached $result, overrunning FirstBootstrapObjectId\n" + if $result >= $FirstBootstrapObjectId; + + return $result; +} + sub usage { die <
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On 2021-May-26, Tom Lane wrote: > I wrote: > > I think this is about ready to commit now (though I didn't yet nuke > > GetDefaultToastCompression). > > Here's a bundled-up final version, in case anybody would prefer > to review it that way. Looks good to me. I tested the behavior with partitioned tables and it seems OK. It would be good to have a test case in src/bin/pg_dump/t/002_pg_dump.pl for the case ... and I find it odd that we don't seem to have anything for the "CREATE TABLE foo (LIKE sometab INCLUDING stuff)" form of the command ... but neither of those seem the fault of this patch, and they both work as [I think] is intended. -- Álvaro Herrera Valdivia, Chile "Estoy de acuerdo contigo en que la verdad absoluta no existe... El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Alvaro Herrera writes: > Looks good to me. > I tested the behavior with partitioned tables and it seems OK. Thanks for reviewing/testing! > It would be good to have a test case in src/bin/pg_dump/t/002_pg_dump.pl > for the case Personally I won't touch 002_pg_dump.pl with a 10-foot pole, but if somebody else wants to, have at it. > ... and I find it odd that we don't seem to have anything > for the "CREATE TABLE foo (LIKE sometab INCLUDING stuff)" form of the > command ... but neither of those seem the fault of this patch, and they > both work as [I think] is intended. Hm, there's this in compression.sql: -- test LIKE INCLUDING COMPRESSION CREATE TABLE cmdata2 (LIKE cmdata1 INCLUDING COMPRESSION); \d+ cmdata2 Or did you mean the case with a partitioned table specifically? regards, tom lane
Re: storing an explicit nonce
Hi, On 2021-05-26 07:14:47 +0200, Antonin Houska wrote: > Bruce Momjian wrote: > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > My point was about whether we need to change the nonce, and hence > > WAL-log full page images if we change hint bits. If we don't and > > reencrypt the page with the same nonce, don't we only expose the hint > > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > > cases. > > > > I don't understand your computation above. You decrypt the page into > > shared buffers, you change a hint bit, and rewrite the page. You are > > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > > only change the hint bits in the new write? Yea, I had a bit of a misfire there. Sorry. I suspect that if we try to not disclose data if an attacker has write access, this still leaves us with issues around nonce reuse, unless we also employ integrity measures. Particularly due to CTR mode, which makes it easy to manipulate individual parts of the encrypted page without causing the decrypted page to be invalid. E.g. the attacker can just update pd_upper on the page by a small offset, and suddenly the replay will insert the tuple at a slightly shifted offset - which then seems to leak enough data to actually analyze things? As the patch stands that seems trivially doable, because as I read it, most of the page header is not encrypted, and checksums are done of the already encrypted data. But even if that weren't the case, brute forcing 16bit worth of checksum isn't too bad, even though it would obviously make an attack a lot more noisy. https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92 /* * Check if the page has a special size == GISTPageOpaqueData, a valid * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN. This * is true for all GiST pages, and perhaps a few pages that are not. The * only downside of guessing wrong is that we might not update the LSN for * some non-permanent relation page changes, and therefore reuse the IV, * which seems acceptable. */ Huh? Regards, Andres
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On 2021-May-26, Tom Lane wrote: > Alvaro Herrera writes: > > It would be good to have a test case in src/bin/pg_dump/t/002_pg_dump.pl > > for the case > > Personally I won't touch 002_pg_dump.pl with a 10-foot pole, but if > somebody else wants to, have at it. Nod. > > ... and I find it odd that we don't seem to have anything > > for the "CREATE TABLE foo (LIKE sometab INCLUDING stuff)" form of the > > command ... but neither of those seem the fault of this patch, and they > > both work as [I think] is intended. > > Hm, there's this in compression.sql: > > -- test LIKE INCLUDING COMPRESSION > CREATE TABLE cmdata2 (LIKE cmdata1 INCLUDING COMPRESSION); > \d+ cmdata2 > > Or did you mean the case with a partitioned table specifically? Ah, I guess that's sufficient. (The INCLUDING clause cannot be used to create a partition, actually.) -- Álvaro Herrera39°49'30"S 73°17'W "Now I have my system running, not a byte was off the shelf; It rarely breaks and when it does I fix the code myself. It's stable, clean and elegant, and lightning fast as well, And it doesn't cost a nickel, so Bill Gates can go to hell."
Re: storing an explicit nonce
Hi, On 2021-05-25 22:23:46 -0400, Stephen Frost wrote: > Andres mentioned other possible cases where the LSN doesn’t change even > though we change the page and, as he’s probably right, we would have to > figure out a solution in those cases too (potentially including cases like > crash recovery or replay on a replica where we can’t really just go around > creating dummy WAL records to get new LSNs..). Yea, I think there's quite a few of those. For one, we don't guarantee that that the hole between pd_lower/upper is zeroes. It e.g. contains old tuple data after deleted tuples are pruned away. But when logging an FPI, we omit that range. Which means that after crash recovery the area is zeroed out. There's several cases where padding can result in the same. Just look at checkXLogConsistency(), heap_mask() et al for all the differences that can occur and that need to be ignored for the recovery consistency checking to work. Particularly the hole issue seems trivial to exploit, because we know the plaintext of the hole after crash recovery (0s). I don't see how using the LSN alone is salvagable. Greetings, Andres Freund
Re: storing an explicit nonce
Hi, On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > If we used a block cipher instead of a streaming one (CTR), this might > not work because the earlier blocks can be based in the output of > later blocks. What made us choose CTR for WAL & data file encryption? I checked the README in the patchset and the wiki page, and neither seem to discuss that. The dangers around nonce reuse, the space overhead of storing the nonce, the fact that single bit changes in the encrypted data don't propagate seem not great? Why aren't we using something like XTS? It has obvious issues as wel, but CTR's weaknesses seem at least as great. And if we want a MAC, then we don't want CTR either. Greetings, Andres Freund
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Wed, May 26, 2021 at 07:44:03PM -0400, Alvaro Herrera wrote: > On 2021-May-26, Tom Lane wrote: >> Personally I won't touch 002_pg_dump.pl with a 10-foot pole, but if >> somebody else wants to, have at it. > > Nod. Yeah, having an extra test for partitioned tables would be a good idea. >> Hm, there's this in compression.sql: >> >> -- test LIKE INCLUDING COMPRESSION >> CREATE TABLE cmdata2 (LIKE cmdata1 INCLUDING COMPRESSION); >> \d+ cmdata2 >> >> Or did you mean the case with a partitioned table specifically? > > Ah, I guess that's sufficient. (The INCLUDING clause cannot be used to > create a partition, actually.) +column_compression: + COMPRESSION ColId { $$ = $2; } + | COMPRESSION DEFAULT { $$ = pstrdup("default"); } Could it be possible to have some tests for COMPRESSION DEFAULT? It seems to me that this should be documented as a supported keyword for CREATE/ALTER TABLE. --changing column storage should not impact the compression method --but the data should not be compressed ALTER TABLE cmdata2 ALTER COLUMN f1 TYPE varchar; +ALTER TABLE cmdata2 ALTER COLUMN f1 SET COMPRESSION pglz; This comment needs a refresh? -- Michael signature.asc Description: PGP signature
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Michael Paquier writes: > Yeah, having an extra test for partitioned tables would be a good > idea. We do have some coverage already via the pg_upgrade test. > Could it be possible to have some tests for COMPRESSION DEFAULT? It > seems to me that this should be documented as a supported keyword for > CREATE/ALTER TABLE. Uh, I did do both of those, no? (The docs treat "default" as another possible value, not a keyword, even though it's a keyword internally.) > --changing column storage should not impact the compression method > --but the data should not be compressed > ALTER TABLE cmdata2 ALTER COLUMN f1 TYPE varchar; > +ALTER TABLE cmdata2 ALTER COLUMN f1 SET COMPRESSION pglz; > This comment needs a refresh? It's correct AFAICS. Maybe it needs a bit of editing for clarity, but I'm not sure how to make it better. The point is that the SET STORAGE just below disables compression of new values, no matter what SET COMPRESSION says. regards, tom lane
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Hi, On 2021-05-25 14:46:27 +0900, Michael Paquier wrote: > That would work. Your suggestion, as I understood it first, makes the > code simpler by not using tup_values at all as the set of values[] is > filled when the values and nulls are extracted. So I have gone with > this simplification, and applied the patch (moved a bit the comments > while on it). Hm. memsetting values_free() to zero repeatedly isn't quite free, nor is iterating over all columns one more time. Note that values/isnull are passed in, and allocated with an accurate size, so it's a bit odd to then do a pessimally sized stack allocation. Efficiency aside, that just seems a bit weird? The efficiency bit is probably going to be swamped by the addition of the compression handling, given the amount of additional work we're now doing in in reform_and_rewrite_tuple(). I wonder if we should check how much slower a VACUUM FULL of a table with a few varlena columns has gotten vs 13. Greetings, Andres Freund
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Andres Freund writes: > The efficiency bit is probably going to be swamped by the addition of > the compression handling, given the amount of additional work we're now > doing in in reform_and_rewrite_tuple(). Only if the user has explicitly requested a change of compression, no? regards, tom lane
Re: Race condition in recovery?
At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar wrote in > On Wed, 26 May 2021 at 10:06 PM, Robert Haas wrote: > > > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar > > wrote: > > > I will check if there is any timing dependency in the test case. > > > > There is. I explained it in the second part of my email, which you may > > have failed to notice. > > > Sorry, my bad. I got your point now. I will change the test. I didn't noticed that but that is actually possible to happen. By the way I'm having a hard time understanding what was happening on this thread. In the very early in this thread I posted a test script that exactly reproduces Dilip's case by starting from two standbys based on his explanation. But *we* didn't understand what the original commit ee994272ca intended and I understood that we wanted to know it. So in the mail [1] and [2] I tried to describe what's going on around the two issues. Although I haven't have a response to [2], can I think that we clarified the intention of ee994272ca? And may I think that we decided that we don't add a test for the commit? Then it seems to me that Robert refound how to reproduce Dilip's case using basebackup instead of using two standbys. It is using a broken archive_command with pg_basebackup -Xnone and I showed that the same resulting state is available by pg_basebackup -Xstream/fetch clearing pg_wal directory of the resulting backup including an explanation of why. *I* think that it is better to avoid to have the archive_command since it seems to me that just unlinking some files seems simpler tha having the broken archive_command. However, since Robert ignored it, I guess that Robert thinks that the broken archive_command is better than that. It my understanding above about the current status of this thread is right? FWIW, regarding to the name of the test script, putting aside what it actually does, I proposed to place it as a part or 004_timeline_switch.pl because this issue is related to timeline switching. [1] 20210521.112105.27166595366072396.horikyota@gmail.com [2] https://www.postgresql.org/message-id/20210524.113402.1922481024406047229.horikyota@gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Wed, May 26, 2021 at 08:35:46PM -0400, Tom Lane wrote: > Andres Freund writes: > > The efficiency bit is probably going to be swamped by the addition of > > the compression handling, given the amount of additional work we're now > > doing in in reform_and_rewrite_tuple(). > > Only if the user has explicitly requested a change of compression, no? Andres' point is that we'd still initialize and run through values_free at the end of reform_and_rewrite_tuple() for each tuple even if there no need to do so. Well, we could control the initialization and the free() checks at the end of the routine if we know that there has been at least one detoasted value, at the expense of making the code a bit less clear, of course. -- Michael signature.asc Description: PGP signature
RE: [HACKERS] logical decoding of two-phase transactions
On Wed, May 26, 2021 10:13 PM Ajin Cherian wrote: > > I've attached a patch that fixes this issue. Do test and confirm. > Thanks for your patch. I have tested and confirmed that the issue I reported has been fixed. Regards Tang
Re: Speed up pg_checksums in cases where checksum already set
On Thu, May 27, 2021 at 5:24 AM Greg Sabino Mullane wrote: > > The attached patch makes an optimization to pg_checksums which prevents > rewriting the block if the checksum is already what we expect. This can lead > to much faster runs in cases where it is already set (e.g. enabled -> > disabled -> enable, external helper process, interrupted runs, future > parallel processes). There is also an effort to not sync the data directory > if no changes were written. Finally, added a bit more output on how many > files were actually changed, e.g.: I don't know how often this will actually help as probably people aren't toggling the checksum state that often, but it seems like a good idea overall. The patch looks sensible to me.
RE: Parallel Inserts in CREATE TABLE AS
From: Bharath Rupireddy Sent: Wednesday, May 26, 2021 7:22 PM > Thanks for trying that out. > > Please see the code around the use_fsm flag in RelationGetBufferForTuple for > more understanding of the points below. > > What happens if FSM is skipped i.e. myState->ti_options = > TABLE_INSERT_SKIP_FSM;? > 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple. > 2) Each worker initially gets a block and keeps inserting into it until it is > full. > When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace > as use_fsm is false. It directly goes for relation extension and tries to > acquire > relation extension lock with LockRelationForExtension. Note that the bulk > extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is > false. > 3) After acquiring the relation extension lock, it adds an extra new block > with > ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever > extension we performed above, we always add at least one block to satisfy our > own request." The tuple is inserted into this new block. > > Basically, the workers can't look for the empty pages from the pages added by > other workers, they keep doing the above steps in silos. > > What happens if FSM is not skipped i.e. myState->ti_options = 0;? > 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple. > 2) Each worker initially gets a block and keeps inserting into it until it is > full. > When the block is full, the worker looks for the page with free space in FSM > GetPageWithFreeSpace as use_fsm is true. > If it can't find any page with the required amount of free space, it goes for > bulk > relation extension(RelationAddExtraBlocks) after acquiring relation extension > lock with ConditionalLockRelationForExtension. Then the worker adds > extraBlocks = Min(512, lockWaiters * 20); new blocks in > RelationAddExtraBlocks and immediately updates the bottom level of FSM for > each block (see the comment around RecordPageWithFreeSpace for why only > the bottom level, not the entire FSM tree). After all the blocks are added, > then > it updates the entire FSM tree FreeSpaceMapVacuumRange. > 4) After the bulk extension, then the worker adds another block see the > comment "In addition to whatever extension we performed above, we always > add at least one block to satisfy our own request." and inserts tuple into > this > new block. > > Basically, the workers can benefit from the bulk extension of the relation and > they always can look for the empty pages from the pages added by other > workers. There are high chances that the blocks will be available after bulk > extension. Having said that, if the added extra blocks are consumed by the > workers so fast i.e. if the tuple sizes are big i.e very less tuples per > page, then, > the bulk extension too can't help much and there will be more contention on > the relation extension lock. Well, one might think to add more blocks at a > time, > say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, > lockWaiters * 20);. This will work (i.e. we don't see any regression with > parallel > inserts in CTAS patches), but it can't be a practical solution. Because the > total > pages for the relation will be more with many pages having more free space. > Furthermore, the future sequential scans on that relation might take a lot of > time. > > If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if > (myState->is_parallel)), then it will be effective for leader i.e. leader > will not > look for FSM, but all the workers will, because within if > (myState->is_parallel_worker) in intorel_startup, > myState->ti_options = 0; for workers. > > I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 > bytes > each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn > tuples) with leader participation where I'm seeing regression: > > 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and > workers, then my results are as follows: > 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275 > 2) when myState->ti_options = 0; for both leader and workers, then my results > are as follows: > 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409 > I hope the above explanation and the test results should clarify the fact that > skipping FSM doesn't solve the problem. Let me know if anything is not clear > or > I'm missing something. Thanks for the explanation. I followed your above test steps and the below configuration, but my test results are a little different from yours. I am not sure the exact reason, maybe because of the hardware.. Test INSERT 1000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) columns): SERIAL: 22023.631 ms PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms [SKIP FSM]: 19381.474 ms PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms [SKIP FSM]: 1
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Hi, On 2021-05-26 20:35:46 -0400, Tom Lane wrote: > Andres Freund writes: > > The efficiency bit is probably going to be swamped by the addition of > > the compression handling, given the amount of additional work we're now > > doing in in reform_and_rewrite_tuple(). > > Only if the user has explicitly requested a change of compression, no? Oh, it'll definitely be more expensive in that case - but that seems fair game. What I was wondering about was whether VACUUM FULL would be measurably slower, because we'll now call toast_get_compression_id() on each varlena datum. It's pretty easy for VACUUM FULL to be CPU bound already, and presumably this'll add a bit. Greetings, Andres Freund
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
On Wed, May 26, 2021 at 06:54:15PM -0700, Andres Freund wrote: > Oh, it'll definitely be more expensive in that case - but that seems > fair game. What I was wondering about was whether VACUUM FULL would be > measurably slower, because we'll now call toast_get_compression_id() on > each varlena datum. It's pretty easy for VACUUM FULL to be CPU bound > already, and presumably this'll add a bit. This depends on the number of attributes, but I do see an extra 0.5% __memmove_avx_unaligned_erms in reform_and_rewrite_tuple() for a normal VACUUM FULL with a 1-int-column relation on a perf profile, with rewrite_heap_tuple eating most of it as in the past, so that's within the noise bandwidth if you measure the runtime. What would be the worst case here, a table with one text column made of non-NULL still very short values? -- Michael signature.asc Description: PGP signature
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Hi, On 2021-05-26 18:54:15 -0700, Andres Freund wrote: > On 2021-05-26 20:35:46 -0400, Tom Lane wrote: > > Andres Freund writes: > > > The efficiency bit is probably going to be swamped by the addition of > > > the compression handling, given the amount of additional work we're now > > > doing in in reform_and_rewrite_tuple(). > > > > Only if the user has explicitly requested a change of compression, no? > > Oh, it'll definitely be more expensive in that case - but that seems > fair game. What I was wondering about was whether VACUUM FULL would be > measurably slower, because we'll now call toast_get_compression_id() on > each varlena datum. It's pretty easy for VACUUM FULL to be CPU bound > already, and presumably this'll add a bit. > CREATE UNLOGGED TABLE vacme_text(t01 text not null default 't01',t02 text not null default 't02',t03 text not null default 't03',t04 text not null default 't04',t05 text not null default 't05',t06 text not null default 't06',t07 text not null default 't07',t08 text not null default 't08',t09 text not null default 't09',t10 text not null default 't10'); CREATE UNLOGGED TABLE vacme_int(i1 int not null default '1',i2 int not null default '2',i3 int not null default '3',i4 int not null default '4',i5 int not null default '5',i6 int not null default '6',i7 int not null default '7',i8 int not null default '8',i9 int not null default '9',i10 int not null default '10'); INSERT INTO vacme_text SELECT FROM generate_series(1, 1000); INSERT INTO vacme_int SELECT FROM generate_series(1, 1000); I ran 10 VACUUM FULLs on each, chose the shortest time: unmodified text: 3562ms int:3037ms after ifdefing out the compression handling: text: 3175ms (x 0.88) int:2894ms (x 0.95) That's not *too* bad, but also not nothing Greetings, Andres Freund
Re: Add ZSON extension to /contrib/
On 5/26/21 5:29 PM, Bruce Momjian wrote: > On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote: >> Hi hackers, >> >> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. >> The >> extension introduces the new ZSON type, which is 100% compatible with JSONB >> but >> uses a shared dictionary of strings most frequently used in given JSONB >> documents for compression. These strings are replaced with integer IDs. >> Afterward, PGLZ (and now LZ4) applies if the document is large enough by >> common >> PostgreSQL logic. Under certain conditions (many large documents), this saves >> disk space, memory and increases the overall performance. More details can be >> found in README on GitHub. > I think this is interesting because it is one of the few cases that > allow compression outside of a single column. Here is a list of > compression options: > > https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020 > > 1. single field > 2. across rows in a single page > 3. across rows in a single column > 4. across all columns and rows in a table > 5. across tables in a database > 6. across databases > > While standard Postgres does #1, ZSON allows 2-5, assuming the data is > in the ZSON data type. I think this cross-field compression has great > potential for cases where the data is not relational, or hasn't had time > to be structured relationally. It also opens questions of how to do > this cleanly in a relational system. > I think we're going to get the best bang for the buck on doing 2, 3, and 4. If it's confined to a single table then we can put a dictionary in something like a fork. Maybe given partitioning we want to be able to do multi-table dictionaries, but that's less certain. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Speed up pg_checksums in cases where checksum already set
On Wed, May 26, 2021 at 05:23:55PM -0400, Greg Sabino Mullane wrote: > The attached patch makes an optimization to pg_checksums which prevents > rewriting the block if the checksum is already what we expect. This can > lead to much faster runs in cases where it is already set (e.g. enabled -> > disabled -> enable, external helper process, interrupted runs, future > parallel processes). Makes sense. > There is also an effort to not sync the data directory > if no changes were written. Finally, added a bit more output on how many > files were actually changed, e.g.: -if (do_sync) +if (do_sync && total_files_modified) { pg_log_info("syncing data directory"); fsync_pgdata(DataDir, PG_VERSION_NUM); Here, I am on the edge. It could be an advantage to force a flush of the data folder anyway, no? Say, all the pages have a correct checksum and they are in the OS cache, but they may not have been flushed yet. That would emulate what initdb -S does already. > Checksum operation completed > Files scanned: 1236 > Blocks scanned: 23283 > Files modified: 38 > Blocks modified: 19194 > pg_checksums: syncing data directory > pg_checksums: updating control file > Checksums enabled in cluster The addition of the number of files modified looks like an advantage. -- Michael signature.asc Description: PGP signature
RE: Parallel Inserts in CREATE TABLE AS
Thank you for the detailed analysis, I'll look into it too. (The times have changed...) From: Bharath Rupireddy > Well, one might think to add more blocks at a time, say > Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = > Min(512, lockWaiters * 20);. This will work (i.e. we don't see any > regression with parallel inserts in CTAS patches), but it can't be a > practical solution. Because the total pages for the relation will be > more with many pages having more free space. Furthermore, the future > sequential scans on that relation might take a lot of time. > Otherwise, the similar speed up can be observed when the BAS_BULKWRITE > is increased a bit from the current 16MB to some other reasonable > value. I earlier tried these experiments. > > Otherwise, as I said in [1], we can also increase the number of extra > blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than > currently extraBlocks = Min(512, lockWaiters * 20);. This will also > give some speedup and we don't see any regression with parallel > inserts in CTAS patches. > > But, I'm not so sure that the hackers will agree any of the above as a > practical solution to the "relation extension" problem. I think I understand your concern about resource consumption and impact on other concurrently running jobs (OLTP, data analysis.) OTOH, what's the situation like when the user wants to run CTAS, and further, wants to speed it up by using parallelism? isn't it okay to let the (parallel) CTAS use as much as it wants? At least, I think we can provide another mode for it, like Oracle provides conditional path mode and direct path mode for INSERT and data loading. What do we want to do to maximize parallel CTAS speedup if we were a bit unshackled from the current constraints (alignment with existing code, impact on other concurrent workloads)? * Use as many shared buffers as possible to decrease WAL flush. Otherwise, INSERT SELECT may be faster? * Minimize relation extension (= increase the block count per extension) posix_fallocate() would help too. * Allocate added pages among parallel workers, and each worker fills pages to their full capacity. The worker that extended the relation stores the page numbers of added pages in shared memory for parallel execution. Each worker gets a page from there after waiting for the relation extension lock, instead of using FSM. The last pages that the workers used will be filled halfway, but the amount of unused space should be low compared to the total table size. Regards Takayuki Tsunakawa
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Hi, On 2021-05-27 11:07:53 +0900, Michael Paquier wrote: > This depends on the number of attributes, but I do see an extra 0.5% > __memmove_avx_unaligned_erms in reform_and_rewrite_tuple() for a > normal VACUUM FULL with a 1-int-column relation on a perf profile, > with rewrite_heap_tuple eating most of it as in the past, so that's > within the noise bandwidth if you measure the runtime. > What would be the worst case here, a table with one text column made > of non-NULL still very short values? I think you need a bunch of columns to see it, like in the benchmark I just posted - I didn't test any other number of columns than 10 though. Greetings, Andres Freund
Re: Speed up pg_checksums in cases where checksum already set
In one of the checksum patches, there was an understanding that the pages should be written even if the checksum is correct, to handle replicas. >From the v19 patch: https://www.postgresql.org/message-id/F7AFCFCD-8F77-4546-8D42-C7F675A4B680%40yesql.se +* Mark the buffer as dirty and force a full page write. We have to +* re-write the page to WAL even if the checksum hasn't changed, +* because if there is a replica it might have a slightly different +* version of the page with an invalid checksum, caused by unlogged +* changes (e.g. hintbits) on the master happening while checksums +* were off. This can happen if there was a valid checksum on the page +* at one point in the past, so only when checksums are first on, then +* off, and then turned on again. pg_checksums(1) says: | When using a replication setup with tools which perform direct copies of relation file blocks (for example pg_rewind(1)), enabling or disabling checksums can lead to page | corruptions in the shape of incorrect checksums if the operation is not done consistently across all nodes. When enabling or disabling checksums in a replication setup, it | is thus recommended to stop all the clusters before switching them all consistently. Destroying all standbys, performing the operation on the primary and finally recreating | the standbys from scratch is also safe. Does your patch complicate things for the "stop all the clusters before switching them all" case? -- Justin
Re: Remaining references to RecentGlobalXmin
Hi, On 2021-05-24 15:47:48 +0900, Michael Paquier wrote: > dc7420c2 has removed RecentGlobalXmin, but there are still references > to it in the code, and a set of FIXME references, like this one in > autovacuum.c (three in total): > /* > * Start a transaction so we can access pg_database, and get a snapshot. > * We don't have a use for the snapshot itself, but we're interested in > * the secondary effect that it sets RecentGlobalXmin. (This is critical > * for anything that reads heap pages, because HOT may decide to prune > * them even if the process doesn't attempt to modify any tuples.) > * > * FIXME: This comment is inaccurate / the code buggy. A snapshot that is > * not pushed/active does not reliably prevent HOT pruning (->xmin could > * e.g. be cleared when cache invalidations are processed). > */ > > Wouldn't it be better to clean up that? Sure, but the real cleanup necessary isn't to remove the reference to RecentGlobalXmin nor specific to 14. It's that the code isn't right, and hasn't been for a long time. https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de Greetings, Andres Freund
Re: [HACKERS] logical decoding of two-phase transactions
On Thu, May 27, 2021 at 11:20 AM tanghy.f...@fujitsu.com wrote: > > On Wed, May 26, 2021 10:13 PM Ajin Cherian wrote: > > > > I've attached a patch that fixes this issue. Do test and confirm. > > > > Thanks for your patch. > I have tested and confirmed that the issue I reported has been fixed. Thanks for confirmation. The problem seemed to be as you reported a table not closed when a transaction was committed. This seems to be because the function UpdateTwoPhaseState was committing a transaction inside the function when the caller of UpdateTwoPhaseState had a table open in CreateSubscription. This function was newly included in the CreateSubscription code, to handle the new use case of two_phase being enabled on create subscription if "copy_data = false". I don't think CreateSubscription required this to be inside a transaction and the committing of transaction was only meant for where this function was originally created to be used in the apply worker code (ApplyWorkerMain()). So, I removed the committing of the transaction from inside the function UpdateTwoPhaseState() and instead started and committed the transaction prior to and after this function is invoked in the apply worker code. regards, Ajin Cherian Fujitsu Australia
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Andres Freund writes: > That's not *too* bad, but also not nothing The memsets seem to be easy to get rid of. memset the array to zeroes *once* before entering the per-tuple loop. Then, in the loop that looks for stuff to pfree, reset any entries that are found to be set, thereby returning the array to all zeroes for the next iteration. I"m having a hard time though believing that the memset is the main problem. I'd think the pfree search loop is at least as expensive. Maybe skip that when not useful, by having a single bool flag remembering whether any columns got detoasted in this row? regards, tom lane
RE: Skip partition tuple routing with constant partition key
Hi amit-san From: Amit Langote Sent: Wednesday, May 26, 2021 9:38 AM > > Hou-san, > > On Wed, May 26, 2021 at 10:05 AM houzj.f...@fujitsu.com > wrote: > > > > Thanks for the explanation ! > > Yeah, we can get all the parent table info from PartitionTupleRouting when > INSERT into a partitioned table. > > > > But I have two issues about using the information from PartitionTupleRouting > to get the parent table's key expression: > > 1) It seems we do not initialize the PartitionTupleRouting when directly > INSERT into a partition(not a partitioned table). > > I think it will be better we let the pre-compute-key_expression > > feature to be used in all the possible cases, because it could bring nice > performance improvement. > > > > 2) When INSERT into a partitioned table which is also a partition, the > PartitionTupleRouting is initialized after the ExecPartitionCheck. > > Hmm, do we really need to optimize ExecPartitionCheck() when partitions are > directly inserted into? As also came up earlier in the thread, we want to > discourage users from doing that to begin with, so it doesn't make much sense > to spend our effort on that case. > > Optimizing ExecPartitionCheck(), specifically your idea of pre-computing the > partition key expressions, only came up after finding that the earlier patch > to > teach ExecFindPartition() to cache partitions may benefit from it. IOW, > optimizing ExecPartitionCheck() for its own sake does not seem worthwhile, > especially not if we'd need to break module boundaries to make that happen. > > Thoughts? OK, I see the point, thanks for the explanation. Let try to move forward. About teaching relcache about caching the target partition. David-san suggested cache the partidx in PartitionDesc. And it will need looping and checking the cached value at each level. I was thinking can we cache a partidx list[1, 2 ,3], and then we can follow the list to get the last partition and do the partition CHECK only for the last partition. If any unexpected thing happen, we can return to the original table and redo the tuple routing without using the cached index. What do you think ? Best regards, houzj
Re: Decoding speculative insert with toast leaks memory
On Wed, Mar 24, 2021 at 10:34 PM Ashutosh Bapat wrote: > Hi All, > We saw OOM in a system where WAL sender consumed Gigabttes of memory > which was never released. Upon investigation, we found out that there > were many ReorderBufferToastHash memory contexts linked to > ReorderBuffer context, together consuming gigs of memory. They were > running INSERT ... ON CONFLICT .. among other things. A similar report > at [1] What is the relationship between this bug and commit 7259736a6e5, dealt specifically with TOAST and speculative insertion resource management issues within reorderbuffer.c? Amit? -- Peter Geoghegan
RE: Fdw batch insert error out when set batch_size > 65535
From: Bharath Rupireddy Sent: Wednesday, May 26, 2021 9:56 PM > On Wed, May 26, 2021 at 6:36 PM Tomas Vondra > wrote: > > > > On 5/26/21 8:57 AM, Bharath Rupireddy wrote: > > > On Tue, May 25, 2021 at 2:47 PM Bharath Rupireddy > > > wrote: > > >> > > >> On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com > > >> wrote: > > >>> Thanks for the comments. I have addressed all comments to the v3 > patch. > > >> > > >> Thanks! The patch basically looks good to me except that it is > > >> missing a commit message. I think it can be added now. > > > > > > With v3 patch, I observed failure in postgres_fdw test cases with > > > insert query in prepared statements. Root cause is that in > > > postgresGetForeignModifyBatchSize, fmstate can be null (see the > > > existing code which handles for fmstate null cases). I fixed this, > > > and added a commit message. PSA v4 patch. > > > > > > > Thanks. In what situation is the fmstate NULL? If it is NULL, the > > current code simply skips the line adjusting it. Doesn't that mean we > > may not actually fix the bug in that case? > > fmstate i.e. resultRelInfo->ri_FdwState is NULL for EXPLAIN without ANALYZE > cases, below comment says it and we can't get the bug because we don't > actually execute the insert statement. The bug occurs on the remote server > when the insert query with those many query parameters is submitted to the > remote server. Agreed. The "ri_FdwState" is initialized in postgresBeginForeignInsert or postgresBeginForeignModify. I think the above functions are always invoked before getting the batch_size. Only in EXPLAIN mode, it will not initialize the ri_FdwState. /* * Do nothing in EXPLAIN (no ANALYZE) case. resultRelInfo->ri_FdwState * stays NULL. */ if (eflags & EXEC_FLAG_EXPLAIN_ONLY) return; Best regards, houzj
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Hi, On 2021-05-26 22:43:42 -0400, Tom Lane wrote: > The memsets seem to be easy to get rid of. memset the array > to zeroes *once* before entering the per-tuple loop. Then, > in the loop that looks for stuff to pfree, reset any entries > that are found to be set, thereby returning the array to all > zeroes for the next iteration. > I"m having a hard time though believing that the memset is the > main problem. I'd think the pfree search loop is at least as > expensive. Maybe skip that when not useful, by having a single > bool flag remembering whether any columns got detoasted in this > row? Yea, I tested that - it does help in the integer case. But the bigger contributors are the loop over the attributes, and especially the access to the datum's compression method. Particularly the latter seems hard to avoid. Greetings, Andres Freund
Re: Decoding speculative insert with toast leaks memory
On Thu, Mar 25, 2021 at 11:04 AM Ashutosh Bapat wrote: > > Hi All, > We saw OOM in a system where WAL sender consumed Gigabttes of memory > which was never released. Upon investigation, we found out that there > were many ReorderBufferToastHash memory contexts linked to > ReorderBuffer context, together consuming gigs of memory. They were > running INSERT ... ON CONFLICT .. among other things. A similar report > at [1] > .. > > but by then we might have reused the toast_hash and thus can not be > destroyed. But that isn't the problem since the reused toast_hash will > be destroyed eventually. > > It's only when the change next to speculative insert is something > other than INSERT/UPDATE/DELETE that we have to worry about a > speculative insert that was never confirmed. So may be for those > cases, we check whether specinsert != null and destroy toast_hash if > it exists. > Can we consider the possibility to destroy the toast_hash in ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the clean up of memory till the end of stream or txn but there won't be any memory leak. -- With Regards, Amit Kapila.
Re: Decoding speculative insert with toast leaks memory
On Thu, May 27, 2021 at 8:27 AM Peter Geoghegan wrote: > > On Wed, Mar 24, 2021 at 10:34 PM Ashutosh Bapat > wrote: > > Hi All, > > We saw OOM in a system where WAL sender consumed Gigabttes of memory > > which was never released. Upon investigation, we found out that there > > were many ReorderBufferToastHash memory contexts linked to > > ReorderBuffer context, together consuming gigs of memory. They were > > running INSERT ... ON CONFLICT .. among other things. A similar report > > at [1] > > What is the relationship between this bug and commit 7259736a6e5, > dealt specifically with TOAST and speculative insertion resource > management issues within reorderbuffer.c? Amit? > This seems to be a pre-existing bug. This should be reproduced in PG-13 and or prior to that commit. Ashutosh can confirm? -- With Regards, Amit Kapila.
Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Andres Freund writes: > Yea, I tested that - it does help in the integer case. But the bigger > contributors are the loop over the attributes, and especially the access > to the datum's compression method. Particularly the latter seems hard to > avoid. So maybe we should just dump the promise that VACUUM FULL will recompress everything? I'd be in favor of that actually, because it seems 100% outside the charter of either VACUUM FULL or CLUSTER. regards, tom lane
Re: Parallel Inserts in CREATE TABLE AS
On Thu, May 27, 2021 at 7:12 AM houzj.f...@fujitsu.com wrote: > I followed your above test steps and the below configuration, but my test > results are a little different from yours. > I am not sure the exact reason, maybe because of the hardware.. > > Test INSERT 1000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) > columns): > SERIAL: 22023.631 ms > PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms [SKIP FSM]: 19381.474 ms > PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms [SKIP FSM]: 18381.305 ms I'm not sure why there's a huge difference in the execution time, on your system it just takes ~20sec whereas on my system(with SSD) it takes ~115 sec. I hope you didn't try creating the unlogged table in CTAS right? Just for reference, the exact use case I tried is at [1]. The configure command I used to build the postgres source code is at [2]. I don't know whether I'm missing something here. [1] case 4 - 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn tuples DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 name, c7 name, c8 name, c9 name, c10 name, c11 name, c12 name, c13 name, c14 name, c15 name, c16 name, c17 name, c18 name); INSERT INTO tenk1 values(generate_series(1,1000), generate_series(1,1000), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8))); explain analyze verbose create table test as select * from tenk1; [2] ./configure --with-zlib --prefix=$PWD/inst/ --with-openssl --with-readline --with-libxml > war.log && make -j 8 install > war.log 2>&1 & With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com