Possible optimization of heap_lock_tuple()

2021-05-26 Thread Antonin Houska
It seems that a concurrent UPDATE can restart heap_lock_tuple() even if it's
not necessary. Is the attached proposal correct and worth applying?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ac07f2fda..c3310ab4f9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4553,7 +4553,7 @@ l3:
  */
 if (!HeapTupleHeaderIsOnlyLocked(tuple->t_data) &&
 	((tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED) ||
-	 !updated))
+	 (!updated && follow_updates)))
 	goto l3;
 
 /* Things look okay, so we can skip sleeping */


Re: Batch insert in CTAS/MatView code

2021-05-26 Thread Bharath Rupireddy
On Wed, May 26, 2021 at 12:18 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> Hello Paul-san,
>
> From: Daniel Gustafsson 
> > In an off-list discussion with Paul, we decided to withdraw this patch for 
> > now
> > and instead create a new entry when there is a re-worked patch.  This has
> > now
> > been done in the CF app.
>
> Would you mind if I take over this patch for PG 15?  I find this promising, 
> as Bharath-san demonstrated good performance by combining your patch and the 
> parallel CTAS that Bharath-san has been working on.  We'd like to do things 
> that enhance parallelism.
>
> Please allow me to start posting the revised patches next week if I can't get 
> your reply.

Hi,

I think the "New Table Access Methods for Multi and Single Inserts"
patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat
View, Refresh Mat View and so on. It also has a patch for multi
inserts in CTAS and Refresh Mat View
(v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
Please see that thread and feel free to review it.

[1] - 
https://www.postgresql.org/message-id/CALj2ACXdrOmB6Na9amHWZHKvRT3Z0nwTRsCwoMT-npOBtmXLXg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




RE: Batch insert in CTAS/MatView code

2021-05-26 Thread tsunakawa.ta...@fujitsu.com
From: Bharath Rupireddy 
> I think the "New Table Access Methods for Multi and Single Inserts"
> patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat
> View, Refresh Mat View and so on. It also has a patch for multi
> inserts in CTAS and Refresh Mat View
> (v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
> Please see that thread and feel free to review it.

Ouch, I didn't notice that the patch was reborn in that thread.  OK.

Could you add it to the CF if you haven't yet?


Regards
Takayuki Tsunakawa



Re: Batch insert in CTAS/MatView code

2021-05-26 Thread Bharath Rupireddy
On Wed, May 26, 2021 at 12:49 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Bharath Rupireddy 
> > I think the "New Table Access Methods for Multi and Single Inserts"
> > patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat
> > View, Refresh Mat View and so on. It also has a patch for multi
> > inserts in CTAS and Refresh Mat View
> > (v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
> > Please see that thread and feel free to review it.
>
> Ouch, I didn't notice that the patch was reborn in that thread.  OK.
>
> Could you add it to the CF if you haven't yet?

It already has one - https://commitfest.postgresql.org/33/2871/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Fix typo: multiple tuple => tuples

2021-05-26 Thread houzj.f...@fujitsu.com
Hi,

I found a possible typo in the code comments of heap_multi_insert.

- * heap_multi_insert   - insert multiple tuple into a heap
+ * heap_multi_insert   - insert multiple tuples into a heap

Attaching a patch to fix it.

Best regards,
houzj


0001-fix-typo.patch
Description: 0001-fix-typo.patch


Re: Incorrect snapshots while promoting hot standby node when 2PC is used

2021-05-26 Thread Michael Paquier
On Thu, Apr 22, 2021 at 01:36:03PM -0700, Andres Freund wrote:
> The sequence in StartupXLOG() leading to the issue is the following:
> 
> 1) RecoverPreparedTransactions();
> 2) ShutdownRecoveryTransactionEnvironment();
> 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
> 
> Because 2) resets the KnownAssignedXids machinery, snapshots that happen
> between 2) and 3) will not actually look at the procarray to compute
> snapshots, as that only happens within
> 
>   snapshot->takenDuringRecovery = RecoveryInProgress();
>   if (!snapshot->takenDuringRecovery)
> 
> and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
> RECOVERY_STATE_DONE, which is set in 3).

Oh, indeed.  It is easy to see RecentXmin jumping back-and-worth while
running 023_pitr_prepared_xact.pl with a small sleep added just after
ShutdownRecoveryTransactionEnvironment().

> Without prepared transaction there probably isn't an issue, because there
> shouldn't be any other in-progress xids at that point?

Yes, there should not be any as far as I recall.  2PC is kind of
special with its fake ProcArray entries.

> I think to fix the issue we'd have to move
> ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
> = RECOVERY_STATE_DONE.
> 
> The acquisition of ProcArrayLock() in
> ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
> should prevent the data from being removed between the RecoveryInProgress()
> and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().
> 
> I haven't yet figured out whether there would be a problem with deferring the
> other tasks in ShutdownRecoveryTransactionEnvironment() until after
> RECOVERY_STATE_DONE.

Hmm.  This would mean releasing all the exclusive locks tracked by a
standby, as of StandbyReleaseAllLocks(), after opening the instance
for writes after a promotion.  I don't think that's unsafe, but it
would be intrusive.

Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
where we should try to not wipe out the 2PC entries to make sure that
all those snapshots still see the 2PC transactions as something to
count on?  I am attaching a crude patch to show the idea.

Thoughts?
--
Michael
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 42a89fc5dc..2e819163c4 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4853,10 +4853,24 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
 
 	if (!TransactionIdIsValid(removeXid))
 	{
-		elog(trace_recovery(DEBUG4), "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
+		tail = pArray->tailKnownAssignedXids;
+		head = pArray->headKnownAssignedXids;
+
+		for (i = tail; i < head; i++)
+		{
+			if (KnownAssignedXidsValid[i])
+			{
+TransactionId knownXid = KnownAssignedXids[i];
+
+if (!StandbyTransactionIdIsPrepared(knownXid))
+{
+	KnownAssignedXidsValid[i] = false;
+	count++;
+}
+			}
+		}
+
+		goto recompute;
 	}
 
 	elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", removeXid);
@@ -4885,6 +4899,9 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
 		}
 	}
 
+	/* fall through */
+recompute:
+
 	pArray->numKnownAssignedXids -= count;
 	Assert(pArray->numKnownAssignedXids >= 0);
 


signature.asc
Description: PGP signature


Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay

2021-05-26 Thread Masahiko Sawada
On Wed, Apr 14, 2021 at 11:17 PM Mead, Scott  wrote:
>
>
>
> > On Mar 1, 2021, at 8:43 PM, Masahiko Sawada  wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> >
> >
> >
> > On Mon, Feb 8, 2021 at 11:49 PM Mead, Scott  wrote:
> >>
> >> Hello,
> >>   I recently looked at what it would take to make a running autovacuum 
> >> pick-up a change to either cost_delay or cost_limit.  Users frequently 
> >> will have a conservative value set, and then wish to change it when 
> >> autovacuum initiates a freeze on a relation.  Most users end up finding 
> >> out they are in ‘to prevent wraparound’ after it has happened, this means 
> >> that if they want the vacuum to take advantage of more I/O, they need to 
> >> stop and then restart the currently running vacuum (after reloading the 
> >> GUCs).
> >>
> >>  Initially, my goal was to determine feasibility for making this dynamic.  
> >> I added debug code to vacuum.c:vacuum_delay_point(void) and found that 
> >> changes to cost_delay and cost_limit are already processed by a running 
> >> vacuum.  There was a bug preventing the cost_delay or cost_limit from 
> >> being configured to allow higher throughput however.
> >>
> >> I believe this is a bug because currently, autovacuum will dynamically 
> >> detect and increase the cost_limit or cost_delay, but it can never 
> >> decrease those values beyond their setting when the vacuum began.  The 
> >> current behavior is for vacuum to limit the maximum throughput of 
> >> currently running vacuum processes to the cost_limit that was set when the 
> >> vacuum process began.
> >
> > Thanks for your report.
> >
> > I've not looked at the patch yet but I agree that the calculation for
> > autovacuum cost delay seems not to work fine if vacuum-delay-related
> > parameters (e.g., autovacuum_vacuum_cost_delay etc) are changed during
> > vacuuming a table to speed up running autovacuums. Here is my
> > analysis:
>
>
> I appreciate your in-depth analysis and will comment in-line.  That said, I 
> still think it’s important that the attached path is applied.  As it is 
> today, a simple few lines of code prevent users from being able to increase 
> the throughput on vacuums that are running without having to cancel them 
> first.
>
> The patch that I’ve provided allows users to decrease their vacuum_cost_delay 
> and get an immediate boost in performance to their running vacuum jobs.
>
>
> >
> > Suppose we have the following parameters and 3 autovacuum workers are
> > running on different tables:
> >
> > autovacuum_vacuum_cost_delay = 100
> > autovacuum_vacuum_cost_limit = 100
> >
> > Vacuum cost-based delay parameters for each workers are follows:
> >
> > worker->wi_cost_limit_base = 100
> > worker->wi_cost_limit = 66
> > worker->wi_cost_delay = 100

Sorry, worker->wi_cost_limit should be 33.

> >
> > Each running autovacuum has "wi_cost_limit = 66" because the total
> > limit (100) is equally rationed. And another point is that the total
> > wi_cost_limit (198 = 66*3) is less than autovacuum_vacuum_cost_limit,
> > 100. Which are fine.

So the total wi_cost_limit, 99, is less than autovacuum_vacuum_cost_limit, 100.

> >
> > Here let's change autovacuum_vacuum_cost_delay/limit value to speed up
> > running autovacuums.
> >
> > Case 1 : increasing autovacuum_vacuum_cost_limit to 1000.
> >
> > After reloading the configuration file, vacuum cost-based delay
> > parameters for each worker become as follows:
> >
> > worker->wi_cost_limit_base = 100
> > worker->wi_cost_limit = 100
> > worker->wi_cost_delay = 100
> >
> > If we rationed autovacuum_vacuum_cost_limit, 1000, to 3 workers, it
> > would be 333. But since we cap it by wi_cost_limit_base, the
> > wi_cost_limit is 100. I think this is what Mead reported here.
>
>
> Yes, this is exactly correct.  The cost_limit is capped at the cost_limit 
> that was set during the start of a running vacuum.  My patch changes this cap 
> to be the max allowed cost_limit (10,000).

The comment of worker's limit calculation says:

/*
 * We put a lower bound of 1 on the cost_limit, to avoid division-
 * by-zero in the vacuum code.  Also, in case of roundoff trouble
 * in these calculations, let's be sure we don't ever set
 * cost_limit to more than the base value.
 */
worker->wi_cost_limit = Max(Min(limit,
worker->wi_cost_limit_base),
1);

If we use the max cost_limit as the upper bound here, the worker's
limit could unnecessarily be higher than the base value in case of
roundoff trouble? I think that the problem here is rather that we
don't update wi_cost_limit_base and wi_cost_delay when rebalancing the
cost.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Assertion failure while streaming toasted data

2021-05-26 Thread Dilip Kumar
On Wed, May 26, 2021 at 11:53 AM Dilip Kumar  wrote:
>
> On Wed, May 26, 2021 at 11:19 AM Amit Kapila  wrote:
> >

>
> > I searched and didn't find any similar existing tests. Can we think of
> > any other way to test this code path? We already have one copy test in
> > toast.sql, isn't it possible to write a similar test here?
>
> I will check that and let you know.

I have followed this approach to write the test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From af116766c8562c36eb00c836505fe4725d19f81d Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 25 May 2021 14:51:45 +0530
Subject: [PATCH v3] Fix bug while streaming the multi-insert toast changes

While processing the multi-insert we can not clean the toast untill
we get the last insert of the multi-insert.  So mark the transaction
incomplete for streaming if we get any multi-insert change, and keep
it set until we get the last tuple of the multi-insert.
---
 contrib/test_decoding/expected/stream.out   | 24 +
 contrib/test_decoding/sql/stream.sql| 18 +
 src/backend/replication/logical/reorderbuffer.c | 36 +++--
 src/include/replication/reorderbuffer.h | 27 +--
 4 files changed, 64 insertions(+), 41 deletions(-)

diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index e1c3bc8..0f21dcb 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -82,6 +82,30 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  committing streamed transaction
 (13 rows)
 
+-- streaming test for toast with multi-insert
+\COPY stream_test FROM STDIN
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+   data   
+--
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(17 rows)
+
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index ce86c81..4feec62 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -26,5 +26,23 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
 
+-- streaming test for toast with multi-insert
+\COPY stream_test FROM STDIN
+toasted-1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456

Re: locking [user] catalog tables vs 2pc vs logical rep

2021-05-26 Thread Petr Jelinek
On 26 May 2021, at 05:04, Amit Kapila  wrote:
> 
> On Tue, May 25, 2021 at 12:40 PM Michael Paquier  wrote:
>> 
>> On Mon, May 24, 2021 at 10:03:01AM +0530, Amit Kapila wrote:
>>> So, this appears to be an existing caveat of synchronous replication.
>>> If that is the case, I am not sure if it is a good idea to just block
>>> such ops for the prepared transaction. Also, what about other
>>> operations which acquire an exclusive lock on [user]_catalog_tables
>>> like:
>>> cluster pg_trigger using pg_class_oid_index, similarly cluster on any
>>> user_catalog_table, then the other problematic operation could
>>> truncate of user_catalog_table as is discussed in another thread [1].
>>> I think all such operations can block even with synchronous
>>> replication. I am not sure if we can create examples for all cases
>>> because for ex. we don't have use of user_catalog_tables in-core but
>>> maybe for others, we can try to create examples and see what happens?
>>> 
>>> If all such operations can block for synchronous replication and
>>> prepared transactions replication then we might want to document them
>>> as caveats at page:
>>> https://www.postgresql.org/docs/devel/logicaldecoding-synchronous.html
>>> and then also give the reference for these caveats at prepared
>>> transactions 
>>> page:https://www.postgresql.org/docs/devel/logicaldecoding-two-phase-commits.html
>>> 
>>> What do you think?
>> 
>> It seems to me that the 2PC issues on catalog tables and the issues
>> related to logical replication in synchonous mode are two distinct
>> things that need to be fixed separately.
>> 
> 
> Fair enough. But the way we were looking at them as they will also
> block (lead to deadlock) for logical replication of prepared
> transactions and also logical replication in synchonous mode without
> prepared transactions. Now, if we want to deal with the 2PC issues
> separately that should be fine as well. However, for that we need to
> see which all operations we want to block on [user]_catalog_tables.
> The first one is lock command, then there are other operations like
> Cluster which take exclusive lock on system catalog tables and we
> allow them to be part of prepared transactions (example Cluster
> pg_trigger using pg_trigger_oid_index;), another kind of operation is
> Truncate on user_catalog_tables. Now, some of these might not allow
> connecting after restart so we might need to think whether we want to
> prohibit all such operations or only some of them.
> 


IIRC this was discussed the first time 2PC decoding was proposed and everybody 
seemed fine with the limitation so I'd vote for just documenting it, same way 
as the sync rep issue.

If you'd prefer fixing it by blocking something, wouldn't it make more sense to 
simply not allow PREPARE if one of these operations was executed, similarly to 
what we do with temp table access?

--
Petr



Re: [HACKERS] logical decoding of two-phase transactions

2021-05-26 Thread vignesh C
On Tue, May 25, 2021 at 8:54 AM Ajin Cherian  wrote:
>
> On Fri, May 21, 2021 at 6:43 PM Peter Smith  wrote:
>
> > Fixed in v77-0001, v77-0002
>
> Attaching a new patch-set that rebases the patch, addresses review
> comments from Peter as well as a test failure reported by Tang. I've
> also added some new test case into patch-2 authored by Tang.

Thanks for the updated patch, few comments:
1)  Should "The end LSN of the prepare." be changed to "end LSN of the
prepare transaction."?

--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 
 Int64
 
+The end LSN of the prepare.
+
+
+
+
+Int64
+

2) Should the ";" be "," here?
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$:&l

  
   
+   subtwophasestate char
+  
+  
+   State code:
+   d = two_phase mode was not requested, so is disabled;
+   p = two_phase mode was requested, but is
pending enablement;
+   e = two_phase mode was requested, and is enabled.
+  

3) Should end_lsn be commit_end_lsn?
+   prepare_data->commit_end_lsn = pq_getmsgint64(in);
+   if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
elog(ERROR, "end_lsn is not set in commit prepared message");
+   prepare_data->prepare_time = pq_getmsgint64(in);

4) This change is not required

diff --git a/src/include/replication/pgoutput.h
b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
boolmessages;
booltwo_phase;
 } PGOutputData;
-
 #endif /* PGOUTPUT_H */

5) Will the worker receive commit prepared/rollback prepared as we
have skip logic to skip commit prepared / commit rollback in
pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn:

+* It is possible that we haven't received the prepare because
+* the transaction did not have any changes relevant to this
+* subscription and was essentially an empty prepare. In which case,
+* the walsender is optimized to drop the empty transaction and the
+* accompanying prepare. Silently ignore if we don't find the prepared
+* transaction.
 */
-   replorigin_session_origin_lsn = prepare_data.end_lsn;
-   replorigin_session_origin_timestamp = prepare_data.commit_time;
+   if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+   prepare_data.prepare_time))
+   {

6) I'm not sure if we could add some tests for skip empty prepare
transactions, if possible add few tests.

7) We could add some debug level log messages for the transaction that
will be skipped.

Regards,
Vignesh




Re: locking [user] catalog tables vs 2pc vs logical rep

2021-05-26 Thread vignesh C
On Tue, May 25, 2021 at 12:40 PM Michael Paquier  wrote:
>
> On Mon, May 24, 2021 at 10:03:01AM +0530, Amit Kapila wrote:
> > So, this appears to be an existing caveat of synchronous replication.
> > If that is the case, I am not sure if it is a good idea to just block
> > such ops for the prepared transaction. Also, what about other
> > operations which acquire an exclusive lock on [user]_catalog_tables
> > like:
> > cluster pg_trigger using pg_class_oid_index, similarly cluster on any
> > user_catalog_table, then the other problematic operation could
> > truncate of user_catalog_table as is discussed in another thread [1].
> > I think all such operations can block even with synchronous
> > replication. I am not sure if we can create examples for all cases
> > because for ex. we don't have use of user_catalog_tables in-core but
> > maybe for others, we can try to create examples and see what happens?
> >
> > If all such operations can block for synchronous replication and
> > prepared transactions replication then we might want to document them
> > as caveats at page:
> > https://www.postgresql.org/docs/devel/logicaldecoding-synchronous.html
> > and then also give the reference for these caveats at prepared
> > transactions 
> > page:https://www.postgresql.org/docs/devel/logicaldecoding-two-phase-commits.html
> >
> > What do you think?
>
> It seems to me that the 2PC issues on catalog tables and the issues
> related to logical replication in synchonous mode are two distinct
> things that need to be fixed separately.
>
> The issue with LOCK taken on a catalog while a PREPARE TRANSACTION
> holds locks around is bad enough in itself as it could lock down a
> user from a cluster as long as the PREPARE TRANSACTION is not removed
> from WAL (say the relation is critical for the connection startup).
> This could be really disruptive for the user even if he tried to take
> a lock on an object he owns, and the way to recover is not easy here,
> and the way to recover involves either an old backup or worse,
> pg_resetwal.
>
> The second issue with logical replication is still disruptive, but it
> looks to me more like a don't-do-it issue, and documenting the caveats
> sounds fine enough.
>
> Looking at the patch from upthread..
>
> +   /*
> +* Make note that we've locked a system table or an user catalog
> +* table. This flag will be checked later during prepare 
> transaction
> +* to fail the prepare transaction.
> +*/
> +   if (lockstmt->mode >= ExclusiveLock &&
> +   (IsCatalogRelationOid(reloid) ||
> +RelationIsUsedAsCatalogTable(rel)))
> +   MyXactFlags |= XACT_FLAGS_ACQUIREDEXCLUSIVELOCK_SYSREL;
> I think that I'd just use IsCatalogRelationOid() here, and I'd be more
> severe and restrict all attempts for any lock levels.  It seems to me
> that this needs to happen within RangeVarCallbackForLockTable().
> I would also rename the flag as just XACT_FLAGS_LOCKEDCATALOG.
>
> +errmsg("cannot PREPARE a transaction that has an exclusive 
> lock on user catalog/system table(s)")));
> What about "cannot PREPARE a transaction that has locked a catalog
> relation"?

At this point it is not clear if we are planning to fix this issue by
throwing an error or document it. I will fix these comments once we
come to consensus.

Regards,
Vignesh




Re: Skipping logical replication transactions on subscriber side

2021-05-26 Thread Amit Kapila
On Tue, May 25, 2021 at 6:12 PM Masahiko Sawada  wrote:
>
> On Tue, May 25, 2021 at 7:21 PM Bharath Rupireddy
>  wrote:
> >
> > If there's no way to get the "correct LSN", then why can't we just
> > print that LSN in the error context and/or in the new statistics view
> > for logical replication workers, so that any of the existing ways can
> > be used to skip exactly one txn?
>
> I think specifying XID to the subscription is more understandable for users.
>

I agree with you that specifying XID could be easier and
understandable for users. I was thinking and studying a bit about what
other systems do in this regard. Why don't we try to provide conflict
resolution methods for users? The idea could be that either the
conflicts can be resolved automatically or manually. In the case of
manual resolution, users can use the existing methods or the XID stuff
you are proposing here and in case of automatic resolution, the
in-built or corresponding user-defined functions will be invoked for
conflict resolution. There are more details to figure out in the
automatic resolution scheme but I see a lot of value in doing the
same.

-- 
With Regards,
Amit Kapila.




Re: locking [user] catalog tables vs 2pc vs logical rep

2021-05-26 Thread Amit Kapila
On Wed, May 26, 2021 at 1:53 PM Petr Jelinek
 wrote:
>
> On 26 May 2021, at 05:04, Amit Kapila  wrote:
> >
> > On Tue, May 25, 2021 at 12:40 PM Michael Paquier  
> > wrote:
> >>
> >> It seems to me that the 2PC issues on catalog tables and the issues
> >> related to logical replication in synchonous mode are two distinct
> >> things that need to be fixed separately.
> >>
> >
> > Fair enough. But the way we were looking at them as they will also
> > block (lead to deadlock) for logical replication of prepared
> > transactions and also logical replication in synchonous mode without
> > prepared transactions. Now, if we want to deal with the 2PC issues
> > separately that should be fine as well. However, for that we need to
> > see which all operations we want to block on [user]_catalog_tables.
> > The first one is lock command, then there are other operations like
> > Cluster which take exclusive lock on system catalog tables and we
> > allow them to be part of prepared transactions (example Cluster
> > pg_trigger using pg_trigger_oid_index;), another kind of operation is
> > Truncate on user_catalog_tables. Now, some of these might not allow
> > connecting after restart so we might need to think whether we want to
> > prohibit all such operations or only some of them.
> >
>
>
> IIRC this was discussed the first time 2PC decoding was proposed and 
> everybody seemed fine with the limitation so I'd vote for just documenting 
> it, same way as the sync rep issue.
>

+1.

> If you'd prefer fixing it by blocking something, wouldn't it make more sense 
> to simply not allow PREPARE if one of these operations was executed, 
> similarly to what we do with temp table access?
>

The point was that even if somehow we block for prepare, there doesn't
seem to be a simple way for synchronous logical replication which can
also have similar problems. So, I would prefer to document it and we
can even think to backpatch the sync rep related documentation.
Michael seems to be a bit interested in dealing with some of the 2PC
issues due to reasons different than logical replication which I am
not completely sure is a good idea and Tom also feels that is not a
good idea.

-- 
With Regards,
Amit Kapila.




Re: Add ZSON extension to /contrib/

2021-05-26 Thread Aleksander Alekseev
Hi hackers,

Many thanks for your feedback, I very much appreciate it!

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

> IMO we have too d*mn many JSON types already.  If we can find a way
> to shoehorn this optimization into JSONB, that'd be great.  Otherwise
> I do not think it's worth the added user confusion.

Magnus, Tom,

My reasoning is that if the problem can be solved with an extension
there is little reason to modify the core. This seems to be in the
spirit of PostgreSQL. If the community reaches the consensus to modify
the core to introduce a similar feature, we could discuss this as
well. It sounds like a lot of unnecessary work to me though (see
below).

> * doesn't cover all cases, notably indexes.

Tom,

Not sure if I follow. What cases do you have in mind?

> Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.

Matthias,

Good point. I suspect that PostGIS is an exception though...

> I like the idea of the ZSON type, but I'm somewhat disappointed by its
> current limitations

Several people suggested various enhancements right after learning
about ZSON. Time showed, however, that none of the real-world users
really need e.g. more than one common dictionary per database. I
suspect this is because no one has more than 2**16 repeatable unique
strings (one dictionary limitation) in their documents. Thus there is
no benefit in having separate dictionaries and corresponding extra
complexity.

> - Each dictionary uses a lot of memory, regardless of the number of
> actual stored keys. For 32-bit systems the base usage of a dictionary
> without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
> almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
> more than I'd want to install.

You are probably right on this one, this part could be optimized. I
will address this if we agree on submitting the patch.

> - You call gettimeofday() in both dict_get and in get_current_dict_id.
> These functions can be called in short and tight loops (for small GSON
> fields), in which case it would add significant overhead through the
> implied syscalls.

I must admit, I'm not an expert in this area. My understanding is that
gettimeofday() is implemented as single virtual memory access on
modern operating systems, e.g. VDSO on Linux, thus it's very cheap.
I'm not that sure about other supported platforms though. Probably
worth investigating.

> It does mean that you're deTOASTing
> the full GSON field, and that the stored bytestring will not be
> structured / doesn't work well with current debuggers.

Unfortunately, I'm not very well aware of debugging tools in this
context. Could you please name the debuggers I should take into
account?

> We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years 
> ago, and I have permission to contribute those if this proposal is adopted.

Andrew,

That's great, and personally I very much like the enhancements you've
made. Purely out of curiosity, did they ended up as a part of
2ndQiadrant / EDB products? I will be happy to accept a pull request
with these enhancements regardless of how the story with this proposal
ends up.

> Quite so. To some extent it's a toy. But at least one of our customers
> has found it useful, and judging by Aleksander's email they aren't
> alone.

Indeed, this is an extremely simple extension, ~500 effective lines of
code in C. It addresses a somewhat specific scenario, which, to my
regret, doesn't seem to be uncommon. A pain-killer of a sort. In an
ideal world, people suppose simply to normalize their data.

-- 
Best regards,
Aleksander Alekseev




Re: Fix typo: multiple tuple => tuples

2021-05-26 Thread Michael Paquier
On Wed, May 26, 2021 at 07:37:15AM +, houzj.f...@fujitsu.com wrote:
> I found a possible typo in the code comments of heap_multi_insert.
> 
> - *   heap_multi_insert   - insert multiple tuple into a heap
> + *   heap_multi_insert   - insert multiple tuples into a heap
> 
> Attaching a patch to fix it.

Thanks, fixed.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread Bharath Rupireddy
On Tue, May 25, 2021 at 12:05 PM houzj.f...@fujitsu.com
 wrote:
>
> I noticed one place which could be one of the reasons that cause the 
> performance degradation.
>
> +   /*
> +* We don't need to skip contacting FSM while inserting 
> tuples for
> +* parallel mode, while extending the relations, workers 
> instead of
> +* blocking on a page while another worker is inserting, can 
> check the
> +* FSM for another page that can accommodate the tuples. This 
> results
> +* in major benefit for parallel inserts.
> +*/
> +   myState->ti_options = 0;
>
> I am not quite sure that disabling the " SKIP FSM " in parallel worker will 
> bring performance gain.
> In my test environment, if I change this code to use option " 
> TABLE_INSERT_SKIP_FSM ", then there
> seems no performance degradation . Could you please have a try on it ?
> (I test with the SQL you provided earlier[1])

Thanks for trying that out.

Please see the code around the use_fsm flag in
RelationGetBufferForTuple for more understanding of the points below.

What happens if FSM is skipped i.e. myState->ti_options =
TABLE_INSERT_SKIP_FSM;?
1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple.
2) Each worker initially gets a block and keeps inserting into it
until it is full. When the block is full, the worker doesn't look in
FSM GetPageWithFreeSpace as use_fsm is false. It directly goes for
relation extension and tries to acquire relation extension lock with
LockRelationForExtension. Note that the bulk extension of blocks with
RelationAddExtraBlocks is not reached as use_fsm is false.
3) After acquiring the relation extension lock, it adds an extra new
block with ReadBufferBI(relation, P_NEW, ...), see the comment "In
addition to whatever extension we performed above, we always add at
least one block to satisfy our own request." The tuple is inserted
into this new block.

Basically, the workers can't look for the empty pages from the pages
added by other workers, they keep doing the above steps in silos.

What happens if FSM is not skipped i.e. myState->ti_options = 0;?
1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple.
2) Each worker initially gets a block and keeps inserting into it
until it is full. When the block is full, the worker looks for the
page with free space in FSM GetPageWithFreeSpace as use_fsm is true.
If it can't find any page with the required amount of free space, it
goes for bulk relation extension(RelationAddExtraBlocks) after
acquiring relation extension lock with
ConditionalLockRelationForExtension. Then the worker adds extraBlocks
= Min(512, lockWaiters * 20); new blocks in RelationAddExtraBlocks and
immediately updates the bottom level of FSM for each block (see the
comment around RecordPageWithFreeSpace for why only the bottom level,
not the entire FSM tree). After all the blocks are added, then it
updates the entire FSM tree FreeSpaceMapVacuumRange.
4) After the bulk extension, then the worker adds another block see
the comment "In addition to whatever extension we performed above, we
always add at least one block to satisfy our own request." and inserts
tuple into this new block.

Basically, the workers can benefit from the bulk extension of the
relation and they always can look for the empty pages from the pages
added by other workers. There are high chances that the blocks will be
available after bulk extension. Having said that, if the added extra
blocks are consumed by the workers so fast i.e. if the tuple sizes are
big i.e very less tuples per page, then, the bulk extension too can't
help much and there will be more contention on the relation extension
lock. Well, one might think to add more blocks at a time, say
Min(1024, lockWaiters * 128/256/512) than currently extraBlocks =
Min(512, lockWaiters * 20);. This will work (i.e. we don't see any
regression with parallel inserts in CTAS patches), but it can't be a
practical solution. Because the total pages for the relation will be
more with many pages having more free space. Furthermore, the future
sequential scans on that relation might take a lot of time.

If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the
place(within if (myState->is_parallel)), then it will be effective for
leader i.e. leader will not look for FSM, but all the workers will,
because within if (myState->is_parallel_worker) in intorel_startup,
myState->ti_options = 0; for workers.

I ran tests with configuration shown at [1] for the case 4 (2
bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns,
tuple size 1064 bytes, 10mn tuples) with leader participation where
I'm seeing regression:

1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader
and workers, then my results are as follows:
0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275
2) when myState->ti_options = 0; for both leade

Re: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread Bharath Rupireddy
On Tue, May 25, 2021 at 1:10 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> Although this should be a controversial and may be crazy idea, the following 
> change brought 4-11% speedup.  This is because I thought parallel workers 
> might contend for WAL flush as a result of them using the limited ring buffer 
> and flushing dirty buffers when the ring buffer is filled.  Can we take 
> advantage of this?
>
> [GetBulkInsertState]
> /*  bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/
> bistate->strategy = NULL;

You are right. If ring buffer(16MB) is not used and shared
buffers(1GB) are used instead, in your case since the table size is
335MB and it can fit in the shared buffers, there will not be any or
will be very minimal dirty buffer flushing, so there will be more some
more speedup.

Otherwise, the similar speed up can be observed when the BAS_BULKWRITE
is increased a bit from the current 16MB to some other reasonable
value. I earlier tried these experiments.

Otherwise, as I said in [1], we can also increase the number of extra
blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than
currently extraBlocks = Min(512, lockWaiters * 20);. This will also
give some speedup and we don't see any regression with parallel
inserts in CTAS patches.

But, I'm not so sure that the hackers will agree any of the above as a
practical solution to the "relation extension" problem.

[1] 
https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread Bharath Rupireddy
On Tue, May 25, 2021 at 1:50 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: houzj.f...@fujitsu.com 
> > + /*
> > +  * We don't need to skip contacting FSM while inserting tuples
> > for
> > +  * parallel mode, while extending the relations, workers
> > instead of
> > +  * blocking on a page while another worker is inserting, can
> > check the
> > +  * FSM for another page that can accommodate the tuples.
> > This results
> > +  * in major benefit for parallel inserts.
> > +  */
> > + myState->ti_options = 0;
> >
> > I am not quite sure that disabling the " SKIP FSM " in parallel worker will 
> > bring
> > performance gain.
> > In my test environment, if I change this code to use option "
> > TABLE_INSERT_SKIP_FSM ", then there
> > seems no performance degradation.
>
> +1, probably.

I tried to explain it at [1]. Please have a look.

> Does the code comment represent the situation like this?
>
> 1. Worker 1 is inserting into page 1.
>
> 2. Worker 2 tries to insert into page 1, but cannot acquire the buffer 
> content lock of page 1 because worker 1 holds it.
>
> 3. Worker 2 looks up FSM to find a page with enough free space.

I tried to explain it at [1]. Please have a look.

> But isn't FSM still empty during CTAS?

No, FSM will be built on the fly in case if we don't skip the FSM i.e.
myState->ti_options = 0, see RelationGetBufferForTuple with use_fsm =
true -> GetPageWithFreeSpace -> fsm_search -> fsm_set_and_search ->
fsm_readbuf with extend = true.

[1] 
https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Assertion failure while streaming toasted data

2021-05-26 Thread Amit Kapila
On Wed, May 26, 2021 at 1:47 PM Dilip Kumar  wrote:
>
> On Wed, May 26, 2021 at 11:53 AM Dilip Kumar  wrote:
> >
> > On Wed, May 26, 2021 at 11:19 AM Amit Kapila  
> > wrote:
> > >
>
> >
> > > I searched and didn't find any similar existing tests. Can we think of
> > > any other way to test this code path? We already have one copy test in
> > > toast.sql, isn't it possible to write a similar test here?
> >
> > I will check that and let you know.
>
> I have followed this approach to write the test.
>

The changes look good to me. I have made minor modifications in the
comments, see attached. Let me know what do you think?

-- 
With Regards,
Amit Kapila.


v4-0001-Fix-assertion-during-streaming-of-multi-insert-to.patch
Description: Binary data


Re: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread Bharath Rupireddy
On Fri, May 21, 2021 at 3:46 PM Amit Kapila  wrote:
>
> On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy
> >  wrote:
> > >
> >
> > I analyzed performance of parallel inserts in CTAS for different cases
> > with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could
> > gain if the tuple sizes are lower. But if the tuple size is larger
> > i..e 1064bytes, there's a regression with parallel inserts. Upon
> > further analysis, it turned out that the parallel workers are
> > requiring frequent extra blocks addition while concurrently extending
> > the relation(in RelationAddExtraBlocks) and the majority of the time
> > spent is going into flushing those new empty pages/blocks onto the
> > disk.
> >
>
> How you have ensured that the cost is due to the flushing of pages?

I think I'm wrong to just say the problem is with the flushing of
empty pages when bulk extending the relation. I should have said the
problem is with the "relation extension lock", but I will hold on to
it for a moment until I capture the relation extension lock wait
events for the regression causing cases. I will share the information
soon.

> AFAICS, we don't flush the pages rather just write them and then
> register those to be flushed by checkpointer, now it is possible that
> the checkpointer sync queue gets full and the backend has to write by
> itself but have we checked that? I think we can check via wait events,
> if it is due to flush then we should see a lot of file sync
> (WAIT_EVENT_DATA_FILE_SYNC) wait events.

I will also capture the data file sync events along with relation
extension lock wait events.

> The other possibility could
> be that the free pages added to FSM by one worker are not being used
> by another worker due to some reason. Can we debug and check if the
> pages added by one worker are being used by another worker?

I tried to explain it at [1]. Please have a look. It looks like the
burden is more on the "relation extension lock" and the way the extra
new blocks are getting added.

[1] 
https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread Amit Kapila
On Wed, May 26, 2021 at 5:28 PM Bharath Rupireddy
 wrote:
>
> On Fri, May 21, 2021 at 3:46 PM Amit Kapila  wrote:
> >
> > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy
> >  wrote:
> > >
>
> > The other possibility could
> > be that the free pages added to FSM by one worker are not being used
> > by another worker due to some reason. Can we debug and check if the
> > pages added by one worker are being used by another worker?
>
> I tried to explain it at [1]. Please have a look.
>

I have read it but I think we should try to ensure practically what is
happening because it is possible that first time worker checked in FSM
without taking relation extension lock, it didn't find any free page,
and then when it tried to acquire the conditional lock, it got the
same and just extended the relation by one block. So, in such a case
it won't be able to use the newly added pages by another worker. I am
not sure any such thing is happening here but I think it is better to
verify it in some way. Also, I am not sure if just getting the info
about the relation extension lock is sufficient?

--
With Regards,
Amit Kapila.




Re: Assertion failure while streaming toasted data

2021-05-26 Thread Dilip Kumar
On Wed, May 26, 2021 at 5:28 PM Amit Kapila  wrote:
>
> On Wed, May 26, 2021 at 1:47 PM Dilip Kumar  wrote:
> >
> > On Wed, May 26, 2021 at 11:53 AM Dilip Kumar  wrote:
> > >
> > > On Wed, May 26, 2021 at 11:19 AM Amit Kapila  
> > > wrote:
> > > >
> >
> > >
> > > > I searched and didn't find any similar existing tests. Can we think of
> > > > any other way to test this code path? We already have one copy test in
> > > > toast.sql, isn't it possible to write a similar test here?
> > >
> > > I will check that and let you know.
> >
> > I have followed this approach to write the test.
> >
>
> The changes look good to me. I have made minor modifications in the
> comments, see attached. Let me know what do you think?

Your modification looks good to me.  Thanks!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Fdw batch insert error out when set batch_size > 65535

2021-05-26 Thread Tomas Vondra

On 5/26/21 8:57 AM, Bharath Rupireddy wrote:

On Tue, May 25, 2021 at 2:47 PM Bharath Rupireddy
 wrote:


On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com
 wrote:

Thanks for the comments. I have addressed all comments to the v3 patch.


Thanks! The patch basically looks good to me except that it is missing
a commit message. I think it can be added now.


With v3 patch, I observed failure in postgres_fdw test cases with
insert query in prepared statements. Root cause is that in
postgresGetForeignModifyBatchSize, fmstate can be null (see the
existing code which handles for fmstate null cases). I fixed this, and
added a commit message. PSA v4 patch.



Thanks. In what situation is the fmstate NULL? If it is NULL, the 
current code simply skips the line adjusting it. Doesn't that mean we 
may not actually fix the bug in that case?


Also, I think it'd be keep the existing comment, probably as the first 
line of the new comment block.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fdw batch insert error out when set batch_size > 65535

2021-05-26 Thread Bharath Rupireddy
On Wed, May 26, 2021 at 6:36 PM Tomas Vondra
 wrote:
>
> On 5/26/21 8:57 AM, Bharath Rupireddy wrote:
> > On Tue, May 25, 2021 at 2:47 PM Bharath Rupireddy
> >  wrote:
> >>
> >> On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com
> >>  wrote:
> >>> Thanks for the comments. I have addressed all comments to the v3 patch.
> >>
> >> Thanks! The patch basically looks good to me except that it is missing
> >> a commit message. I think it can be added now.
> >
> > With v3 patch, I observed failure in postgres_fdw test cases with
> > insert query in prepared statements. Root cause is that in
> > postgresGetForeignModifyBatchSize, fmstate can be null (see the
> > existing code which handles for fmstate null cases). I fixed this, and
> > added a commit message. PSA v4 patch.
> >
>
> Thanks. In what situation is the fmstate NULL? If it is NULL, the
> current code simply skips the line adjusting it. Doesn't that mean we
> may not actually fix the bug in that case?

fmstate i.e. resultRelInfo->ri_FdwState is NULL for EXPLAIN without
ANALYZE cases, below comment says it and we can't get the bug because
we don't actually execute the insert statement. The bug occurs on the
remote server when the insert query with those many query parameters
is submitted to the remote server. I'm not sure if there are any other
cases where it can be NULL.
/*
 * In EXPLAIN without ANALYZE, ri_fdwstate is NULL, so we have to lookup
 * the option directly in server/table options. Otherwise just use the
 * value we determined earlier.
 */
if (fmstate)
batch_size = fmstate->batch_size;
else
batch_size = get_batch_size_option(resultRelInfo->ri_RelationDesc);

> Also, I think it'd be keep the existing comment, probably as the first
> line of the new comment block.

Do you mean to say we need to retain "/* Otherwise use the batch size
specified for server/table. */"? If so, PSA v5.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v5-0001-Adjust-batch_size-to-not-exceed-max-param-limit-o.patch
Description: Binary data


Re: CREATE COLLATION - check for duplicate options and error out if found one

2021-05-26 Thread vignesh C
On Tue, Apr 27, 2021 at 3:21 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> While reviewing [1], I found that the CREATE COLLATION doesn't throw an error 
> if duplicate options are specified, see [2] for testing a few cases on HEAD. 
> This may end up accepting some of the weird cases, see [2]. It's against 
> other option checking code in the server where the duplicate option is 
> detected and an error thrown if found one.  Attached a patch doing that. I 
> chose to have the error message "option \"%s\" specified more than once" and 
> parser_errposition because that's kind of agreed in [3].
>
> Thoughts?
>
> [1] 
> https://www.postgresql.org/message-id/CALj2ACWVd%3D-E6uG5AdHD0MvHY6e4mVzkapT%3DvLDnJJseGjaJLQ%40mail.gmail.com
>
> [2]
> CREATE COLLATION coll_dup_chk (LC_COLLATE = "POSIX", LC_COLLATE = "NONSENSE", 
> LC_CTYPE = "POSIX"); -- ERROR
> CREATE COLLATION coll_dup_chk (LC_COLLATE = "NONSENSE", LC_COLLATE = "POSIX", 
> LC_CTYPE = "POSIX"); -- OK but it's weird
> CREATE COLLATION coll_dup_chk (LC_CTYPE = "POSIX", LC_CTYPE = "NONSENSE", 
> LC_COLLATE = "POSIX"); -- ERROR
> CREATE COLLATION coll_dup_chk (LC_CTYPE = "NONSENSE", LC_CTYPE = "POSIX", 
> LC_COLLATE = "POSIX",); -- OK but it's weird
> CREATE COLLATION coll_dup_chk (PROVIDER = icu, PROVIDER = NONSENSE, 
> LC_COLLATE = "POSIX", LC_CTYPE = "POSIX"); -- ERROR
> CREATE COLLATION coll_dup_chk (PROVIDER = NONSENSE, PROVIDER = icu, 
> LC_COLLATE = "POSIX", LC_CTYPE = "POSIX"); -- OK but it's weird
> CREATE COLLATION case_sensitive (LOCALE = '', LOCALE = "NONSENSE"); -- ERROR
> CREATE COLLATION coll_dup_chk (LOCALE = "NONSENSE", LOCALE = ''); -- OK but 
> it's weird
> CREATE COLLATION coll_dup_chk (DETERMINISTIC = TRUE, DETERMINISTIC = 
> NONSENSE, LOCALE = ''); -- ERROR
> CREATE COLLATION coll_dup_chk (DETERMINISTIC = NONSENSE, DETERMINISTIC = 
> TRUE, LOCALE = ''); -- OK but it's weird
>
> [3] 
> https://www.postgresql.org/message-id/CALj2ACUa%3DZM8QtOLPCHc7%3DWgFrx9P6-AgKQs8cmKLvNCvu7arQ%40mail.gmail.com

+1 for fixing this issue, we have handled this error in other places.
The patch does not apply on head, could you rebase the patch on head
and post a new patch.

Regards,
Vignesh




Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-05-26 Thread vignesh C
On Mon, Apr 5, 2021 at 7:19 PM Bharath Rupireddy
 wrote:
>
> On Mon, Apr 5, 2021 at 6:41 PM Euler Taveira  wrote:
> > Here's the v4 patch reabsed on the latest master, please review it further.
> >
> > /* UNLOGGED and TEMP relations cannot be part of publication. */
> > if (!RelationIsPermanent(targetrel))
> > - ereport(ERROR,
> > - (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > - errmsg("table \"%s\" cannot be replicated",
> > - RelationGetRelationName(targetrel)),
> > - errdetail("Temporary and unlogged relations cannot be replicated.")));
> > + {
> > + if (RelationUsesLocalBuffers(targetrel))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("\"%s\" is a temporary table",
> > + RelationGetRelationName(targetrel)),
> > + errdetail("Temporary tables cannot be added to publications.")));
> > + else if (targetrel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("\"%s\" is an unlogged table",
> > + RelationGetRelationName(targetrel)),
> > + errdetail("Unlogged tables cannot be added to publications.")));
> > + }
> >
> > RelationIsPermanent(), RelationUsesLocalBuffers(), and
> > targetrel->rd_rel->relpersistence all refers to relpersistence. Hence, it is
> > not necessary to test !RelationIsPermanent().
>
> Done.
>
> > I would slightly rewrite the commit message to something like:
> >
> > Improve publication error messages
> >
> > Adding a foreign table into a publication prints an error saying "foo is 
> > not a
> > table". Although, a foreign table is not a regular table, this message could
> > possibly confuse users. Provide a suitable error message according to the
> > object class (table vs foreign table). While at it, separate unlogged/temp
> > table error message into 2 messages.
>
> Thanks for the better wording.
>
> Attaching v5 patch, please have a look.
>

We get the following error while adding an index:
create publication mypub for table idx_t1;
ERROR:  "idx_t1" is an index

This error occurs internally from table_openrv function call, if we
could replace this with relation_openrv and then check the table kind,
we could throw a similar error message here too like the other changes
in the patch.

Regards,
Vignesh




Re: CREATE COLLATION - check for duplicate options and error out if found one

2021-05-26 Thread Bharath Rupireddy
On Wed, May 26, 2021 at 7:18 PM vignesh C  wrote:
> +1 for fixing this issue, we have handled this error in other places.
> The patch does not apply on head, could you rebase the patch on head
> and post a new patch.

Thanks. I thought of rebasing once the other patch (which reorganizes
"...specified more than once" error) gets committed. Anyways, I've
rebased for now on the latest master. Please review v2 patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Check-duplicate-options-and-error-out-for-CREATE-.patch
Description: Binary data


Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-05-26 Thread Bharath Rupireddy
On Wed, May 26, 2021 at 7:38 PM vignesh C  wrote:
> > Attaching v5 patch, please have a look.
>
> We get the following error while adding an index:
> create publication mypub for table idx_t1;
> ERROR:  "idx_t1" is an index
>
> This error occurs internally from table_openrv function call, if we
> could replace this with relation_openrv and then check the table kind,
> we could throw a similar error message here too like the other changes
> in the patch.

Do you say that we replace table_open in publication_add_relation with
relation_open and have the "\"%s\" is an index" or "\"%s\" is a
composite type" checks in check_publication_add_relation? If that is
so, I don't think it's a good idea to have the extra code in
check_publication_add_relation and I would like it to be the way it is
currently.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: pg_rewind fails if there is a read only file.

2021-05-26 Thread Andrew Dunstan


On 5/26/21 2:43 AM, Laurenz Albe wrote:
> On Wed, 2021-05-26 at 08:57 +0900, Michael Paquier wrote:
>> On Tue, May 25, 2021 at 03:17:37PM -0400, Andrew Dunstan wrote:
>>> If we do decide to do something the question arises what should it do?
>>> If we're to allow for it I'm wondering if the best thing would be simply
>>> to ignore such a file.
>> Enforcing assumptions that any file could be ready-only is a very bad
>> idea, as that could lead to weird behaviors if a FS is turned as
>> becoming read-only suddenly while doing a rewind.  Another idea that
>> has popped out across the years was to add an option to pg_rewind so
>> as users could filter files manually.  That could be easily dangerous
>> though in the wrong hands, as one could think that it is a good idea
>> to skip a control file, for example.
>>
>> The thing is that here we actually know the set of files we'd like to
>> ignore most of the time, and we still want to have some automated
>> control what gets filtered.  So here is a new idea: we build a list of
>> files based on a set of GUC parameters using postgres -C on the target
>> data folder, and assume that these are safe enough to be skipped all
>> the time, if these are in the data folder.
> That sounds complicated, but should work.
> There should be a code comment somewhere that warns people not to forget
> to look at that when they add a new GUC.
>
> I can think of two alternatives to handle this:
>
> - Skip files that cannot be opened for writing and issue a warning.
>   That is simple, but coarse.
>   A slightly more sophisticated version would first check if files
>   are the same on both machines and skip the warning for those.
>
> - Paul's idea to try and change the mode on the read-only file
>   and reset it to the original state after pg_rewind is done.
>

How about we only skip (with a warning) read-only files if they are in
the data root, not a subdirectory? That would save us from silently
ignoring read-only filesystems and not involve using a GUC.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Next Commitfest Manager.

2021-05-26 Thread Ibrar Ahmed
On Thu, Feb 4, 2021 at 1:13 AM Stephen Frost  wrote:

> Greetings,
>
> * Ibrar Ahmed (ibrar.ah...@gmail.com) wrote:
> > Anyone else already volunteers that? It is my first time so need some
> > access, if all agree.
>
> Thanks for volunteering!
>
> That said, our last commitfest tends to be the most difficult as it's
> the last opportunity for features to land in time for the next major
> release and, for my part at least, I think it'd be best to have
> someone who has experience running a CF previously manage it.
>
> To that end, I've talked to David Steele, who has run this last CF for
> the past few years and we're in agreement that he's willing to run this
> CF again this year, assuming there's no objections.  What we've thought
> to suggest is that you follow along with David as he runs this CF and
> then offer to run the July CF.  Of course, we would encourage you and
> David to communicate and for you to ask David any questions you have
> about how he handles things as part of the CF.  This is in line with how
> other CF managers have started out also.
>

As we know July commitfest is coming, I already volunteered to manage that.
I
did small work with David in the last commitfest and now ready to work on
that


>
> Open to your thoughts, as well as those of anyone else who wishes to
> comment.
>
> Thanks!
>
> Stephen
>


-- 
Ibrar Ahmed


Re: Replacing pg_depend PIN entries with a fixed range check

2021-05-26 Thread Robert Haas
On Wed, May 12, 2021 at 6:21 PM Tom Lane  wrote:
> Here's a v2 that does things that way (and is rebased up to HEAD).
> I did some more documentation cleanup, too.

The first hunk of the patch seems to back away from the idea that the
cutoff is 13000, but the second half of the patch says 13000 still
matters. Not sure I understand what's going on there exactly.

I suggest deleting the words "An additional thing that is useful to
know is that" because the rest of the sentence is fine without it.

I'm sort of wondering what we think the long term plan ought to be.
Are there some categories of things we should be looking to move out
of the reserved OID space to keep it from filling up? Can we
realistically think of moving the 16384 boundary?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

2021-05-26 Thread Konstantin Knizhnik




On 25.05.2021 13:55, Aleksander Alekseev wrote:

Hi hackers,

Back in 2016 while being at PostgresPro I developed the ZSON extension 
[1]. The extension introduces the new ZSON type, which is 100% 
compatible with JSONB but uses a shared dictionary of strings most 
frequently used in given JSONB documents for compression. These 
strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) 
applies if the document is large enough by common PostgreSQL logic. 
Under certain conditions (many large documents), this saves disk 
space, memory and increases the overall performance. More details can 
be found in README on GitHub.


The extension was accepted warmly and instantaneously I got several 
requests to submit it to /contrib/ so people using Amazon RDS and 
similar services could enjoy it too. Back then I was not sure if the 
extension is mature enough and if it lacks any additional features 
required to solve the real-world problems of the users. Time showed, 
however, that people are happy with the extension as it is. There were 
several minor issues discovered, but they were fixed back in 2017. The 
extension never experienced any compatibility problems with the next 
major release of PostgreSQL.


So my question is if the community may consider adding ZSON to 
/contrib/. If this is the case I will add this thread to the nearest 
CF and submit a corresponding patch.


[1]: https://github.com/postgrespro/zson

--
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale



Yet another approach to the same problem:

https://github.com/postgrespro/jsonb_schema

Instead of compression JSONs we can try to automatically detect JSON 
schema (names and types of JSON fields) and store it separately from values.
This approach is more similar with one used in schema-less databases. It 
is most efficient if there are many JSON records with the same schema 
and sizes of  keys are comparable with size of values. At IMDB data set 
it cause reducing of database size about 1.7 times.







Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Michael Paquier  writes:
> Ah, the parallel with reltablespace and default_tablespace at database
> level is a very good point.  It is true that currently the code would
> assign attcompression to a non-zero value once the relation is defined
> depending on default_toast_compression set for the database, but
> setting it to 0 in this case would be really helpful to change the
> compression methods of all the relations if doing something as crazy
> as a VACUUM FULL for this database.  Count me as convinced.

Here's a draft patch series to address this.

0001 removes the relkind checks I was questioning originally.
As expected, this results in zero changes in check-world results.

0002 is the main change in the semantics of attcompression.
This does change the results of compression.sql, but in what
seem to me to be expected ways: a column's compression option
is now shown in \d+ output only if you explicitly set it.

0003 further removes pg_dump's special handling of
default_toast_compression.  I don't think we need that anymore.
AFAICS its only effect would be to override the receiving server's
default_toast_compression setting for dumped/re-loaded data, which
does not seem like a behavior that anyone would want.

Loose ends:

* I've not reviewed the docs fully; there are likely some more
things that need updated.

* As things stand here, once you've applied ALTER ... SET COMPRESSION
to select a specific method, there is no way to undo that and go
back to the use-the-default setting.  All you can do is change to
explicitly select the other method.  Should we invent "ALTER ...
SET COMPRESSION default" or the like to cover that?  (Since
DEFAULT is a reserved word, that exact syntax might be a bit of
a pain to implement, but maybe we could think of another word.)

* I find GetDefaultToastCompression() annoying.  I do not think
it is project style to invent trivial wrapper functions around
GUC variable references: it buys nothing while requiring readers
to remember one more name than they would otherwise.  Since there
are only two uses remaining, maybe this isn't very important either
way, but I'm still inclined to flush it.

Comments?

regards, tom lane

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 11e91c4ad3..69b1839fd7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -897,17 +897,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		if (colDef->generated)
 			attr->attgenerated = colDef->generated;
 
-		/*
-		 * lookup attribute's compression method and store it in the
-		 * attr->attcompression.
-		 */
-		if (relkind == RELKIND_RELATION ||
-			relkind == RELKIND_PARTITIONED_TABLE ||
-			relkind == RELKIND_MATVIEW)
-			attr->attcompression =
-GetAttributeCompression(attr, colDef->compression);
-		else
-			attr->attcompression = InvalidCompressionMethod;
+		attr->attcompression = GetAttributeCompression(attr,
+	   colDef->compression);
 	}
 
 	/*
@@ -6602,13 +6593,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	attribute.attbyval = tform->typbyval;
 	attribute.attalign = tform->typalign;
 	attribute.attstorage = tform->typstorage;
-	/* do not set compression in views etc */
-	if (rel->rd_rel->relkind == RELKIND_RELATION ||
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		attribute.attcompression = GetAttributeCompression(&attribute,
-		   colDef->compression);
-	else
-		attribute.attcompression = InvalidCompressionMethod;
+	attribute.attcompression = GetAttributeCompression(&attribute,
+	   colDef->compression);
 	attribute.attnotnull = colDef->is_not_null;
 	attribute.atthasdef = false;
 	attribute.atthasmissing = false;
@@ -12299,23 +12285,10 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
 	attTup->attbyval = tform->typbyval;
 	attTup->attalign = tform->typalign;
 	attTup->attstorage = tform->typstorage;
-
-	/* Setup attribute compression */
-	if (rel->rd_rel->relkind == RELKIND_RELATION ||
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		/*
-		 * No compression for plain/external storage, otherwise, default
-		 * compression method if it is not already set, refer comments atop
-		 * attcompression parameter in pg_attribute.h.
-		 */
-		if (!IsStorageCompressible(tform->typstorage))
-			attTup->attcompression = InvalidCompressionMethod;
-		else if (!CompressionMethodIsValid(attTup->attcompression))
-			attTup->attcompression = GetDefaultToastCompression();
-	}
-	else
+	if (!IsStorageCompressible(tform->typstorage))
 		attTup->attcompression = InvalidCompressionMethod;
+	else if (!CompressionMethodIsValid(attTup->attcompression))
+		attTup->attcompression = GetDefaultToastCompression();
 
 	ReleaseSysCache(typeTuple);
 
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b26b872a06..16493209c6 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 11:13 AM Tom Lane  wrote:
> * As things stand here, once you've applied ALTER ... SET COMPRESSION
> to select a specific method, there is no way to undo that and go
> back to the use-the-default setting.  All you can do is change to
> explicitly select the other method.  Should we invent "ALTER ...
> SET COMPRESSION default" or the like to cover that?  (Since
> DEFAULT is a reserved word, that exact syntax might be a bit of
> a pain to implement, but maybe we could think of another word.)

Yes. Irreversible catalog changes are bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Replacing pg_depend PIN entries with a fixed range check

2021-05-26 Thread Tom Lane
Robert Haas  writes:
> The first hunk of the patch seems to back away from the idea that the
> cutoff is 13000, but the second half of the patch says 13000 still
> matters. Not sure I understand what's going on there exactly.

Not sure exactly what you're looking at, but IIRC there is a place
where the patch is cleaning up after ab596105b's failure to adjust
bki.sgml to match its change of FirstBootstrapObjectId from 12000
to 13000.  I hadn't bothered to fix that separately, but I guess
we should do so, else v14 is going to ship with incorrect docs.

> I'm sort of wondering what we think the long term plan ought to be.
> Are there some categories of things we should be looking to move out
> of the reserved OID space to keep it from filling up? Can we
> realistically think of moving the 16384 boundary?

I haven't got any wonderful ideas there.  I do not see how we can
move the 16384 boundary without breaking pg_upgrade'ing, because
pg_upgrade relies on preserving user object OIDs that are likely
to be not much above that value.  Probably, upping
FirstNormalObjectId ought to be high on our list of things to do
if we ever do force an on-disk compatibility break.  In the
meantime, we could decrease the 1 boundary if things get
tight above that, but I fear that would annoy some extension
maintainers.

Another idea is to give up the principle that initdb-time OIDs
need to be globally unique.  They only really need to be
unique within their own catalogs, so we could buy a lot of space
by exploiting that.  The original reason for that policy was to
reduce the risk of mistakes in handwritten OID references in
the initial catalog data --- but now that numeric references
there are Not Done, it seems like we don't really need that.

An intermediate step, perhaps, could be to give up that
uniqueness only for OIDs assigned by genbki.pl itself, while
keeping it for OIDs below 1.  This'd be appealing if we
find that we're getting tight between 10K and 13K.

In any case it doesn't seem like the issue is entirely pressing
yet.  Although ... maybe we should do that last bit now, so
that we can revert FirstBootstrapObjectId to 12K before v14
ships?  I've felt a little bit of worry that that change might
cause problems on machines with a boatload of locales.

regards, tom lane




Re: Race condition in recovery?

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 2:44 AM Dilip Kumar  wrote:
> I think we need to create some content on promoted standby and check
> whether the cascade standby is able to get that or not, that will
> guarantee that it is actually following the promoted standby,  I have
> added the test for that so that it matches the comments.

OK, but I ran this test against an unpatched server and it passed.
That's not so great, because the test should fail without the bug fix.
It seems to be because there's only one actual test in this test case.
Looking at the log file,
src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
only "ok" nor "not ok" line is:

ok 1 - check streamed content on cascade standby

So either that test is not right or some other test is needed. I think
there's something else going wrong here, because when I run my
original test.sh script, I see this:

2021-05-26 11:37:47.794 EDT [57961] LOG:  restored log file
"0002.history" from archive
...
2021-05-26 11:37:47.916 EDT [57961] LOG:  redo starts at 0/228
...
2021-05-26 11:37:47.927 EDT [57966] LOG:  replication terminated by
primary server
2021-05-26 11:37:47.927 EDT [57966] DETAIL:  End of WAL reached on
timeline 1 at 0/300

But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log
file generated by this test case:

cp: 
/Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0002.history:
No such file or directory
...
2021-05-26 11:41:08.149 EDT [63347] LOG:  fetching timeline history
file for timeline 2 from primary server
...
2021-05-26 11:41:08.288 EDT [63344] LOG:  new target timeline is 2
...
2021-05-26 11:41:08.303 EDT [63344] LOG:  redo starts at 0/228
...
2021-05-26 11:41:08.331 EDT [63347] LOG:  restarted WAL streaming at
0/300 on timeline 2

So it doesn't seem like the test is actually reproducing the problem
correctly. The timeline history file isn't available from the archive,
so it streams it, and then the problem doesn't occur. I guess that's
because there's nothing to guarantee that the history file reaches the
archive before 'cascade' is started. The code just does:

# Promote the standby.
$node_standby->psql('postgres', 'SELECT pg_promote()');

# Start cascade node
$node_cascade->start;

...which has a clear race condition.
src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a
WAL file to be archived, so maybe we can steal that logic and use it
here.

I suggest we rename the test to something a bit more descriptive. Like
instead of 025_timeline_issue.pl, perhaps
025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but
"timeline issue" is a bit too vague for my taste.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Dilip Kumar
On Wed, May 26, 2021 at 9:40 PM Robert Haas  wrote:
>
> On Wed, May 26, 2021 at 2:44 AM Dilip Kumar  wrote:
> > I think we need to create some content on promoted standby and check
> > whether the cascade standby is able to get that or not, that will
> > guarantee that it is actually following the promoted standby,  I have
> > added the test for that so that it matches the comments.
>
> OK, but I ran this test against an unpatched server and it passed.
> That's not so great, because the test should fail without the bug fix.
> It seems to be because there's only one actual test in this test case.
> Looking at the log file,
> src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
> only "ok" nor "not ok" line is:
>
> ok 1 - check streamed content on cascade standby
>
> So either that test is not right or some other test is needed. I think
> there's something else going wrong here, because when I run my
> original test.sh script, I see this:

Thats strange, when I ran the test I can see below in log of cascade
node (which shows that cascade get the history file but not the WAL
file and then it select the old timeline and never go to the new
timeline)

...
2021-05-26 21:46:54.412 IST [84080] LOG:  restored log file
"0002.history" from archive
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0003.history’:
No such file or directory
2021-05-26 21:46:54.415 IST [84080] LOG:  entering standby mode
2021-05-26 21:46:54.419 IST [84080] LOG:  restored log file
"0002.history" from archive
.
2021-05-26 21:46:54.429 IST [84085] LOG:  started streaming WAL from
primary at 0/200 on timeline 1   -> stream using previous TL
2021-05-26 21:46:54.466 IST [84080] LOG:  redo starts at 0/228
2021-05-26 21:46:54.466 IST [84080] LOG:  consistent recovery state
reached at 0/300
2021-05-26 21:46:54.467 IST [84079] LOG:  database system is ready to
accept read only connections
2021-05-26 21:46:54.483 IST [84085] LOG:  replication terminated by
primary server
2021-05-26 21:46:54.483 IST [84085] DETAIL:  End of WAL reached on
timeline 1 at 0/300.
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0003.history’:
No such file or directory
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00010003’:
No such file or directory
2021-05-26 21:46:54.498 IST [84085] LOG:  primary server contains no
more WAL on requested timeline 1



And finally the test case fails because the cascade can never get the changes.

I will check if there is any timing dependency in the test case.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Replacing pg_depend PIN entries with a fixed range check

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 11:37 AM Tom Lane  wrote:
> In any case it doesn't seem like the issue is entirely pressing
> yet.  Although ... maybe we should do that last bit now, so
> that we can revert FirstBootstrapObjectId to 12K before v14
> ships?  I've felt a little bit of worry that that change might
> cause problems on machines with a boatload of locales.

I think that particular case is definitely worth worrying about. Most
of what we put into the system catalogs is our own hand-crafted
entries, but that's coming from the operating system and we have no
control over it whatever. It wouldn't be very nice to have to suggest
to users who get can't initdb that perhaps they should delete some
locales...

Honestly, it seems odd to me that these entries use reserved OIDs
rather than regular ones at all. Why does the first run of
pg_import_system_collations use special magic OIDs, and later runs use
regular OIDs? pg_type OIDs need to remain stable from release to
release since it's part of the on disk format for arrays, and pg_proc
OIDs have to be the same at compile time and initdb time because of
the fmgr hash table, and any other thing that has a constant that
might be used in the source code also has that issue. But none of this
applies to collations: they can't expected to have the same OID from
release to release, or even from one installation to another; the
source code can't be relying on the specific values; and we have no
idea how many there might be.

So I think your proposal of allowing genbki-assigned OIDs to be reused
in different catalogs is probably a pretty good one, but I wonder if
we could just rejigger things so that collations just get normal OIDs
> 16384.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar  wrote:
> I will check if there is any timing dependency in the test case.

There is. I explained it in the second part of my email, which you may
have failed to notice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Dilip Kumar
On Wed, 26 May 2021 at 10:06 PM, Robert Haas  wrote:

> On Wed, May 26, 2021 at 12:26 PM Dilip Kumar 
> wrote:
> > I will check if there is any timing dependency in the test case.
>
> There is. I explained it in the second part of my email, which you may
> have failed to notice.


Sorry, my bad.  I got your point now.  I will change the test.

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: CALL versus procedures with output-only arguments

2021-05-26 Thread Peter Eisentraut

On 25.05.21 22:21, Tom Lane wrote:

Yeah, the odd behavior of CALL is where I started from, but now I think
the main problem is with the signature (ie, allowing procedures with
signatures that differ only in OUT parameter positions).  If we got
rid of that choice then it'd be possible to document that you should
only ever write NULL for OUT-parameter positions, because the type
of such an argument would never be significant for disambiguation.


AFAICT, your patch does not main the property that

CREATE PROCEDURE p1(OUT int, OUT int)

corresponds to

DROP PROCEDURE p1(int, int)

which would be bad.

I'm not opposed to reverting the feature if we can't find a good 
solution in a hurry.  The main value is of this feature is for 
migrations, so I want to be sure that whatever we settle on doesn't back 
us into a corner with respect to that.


We could perhaps also just disable the SQL-level calling until a better 
solution arises.  AFAICT, things work okay in PL/pgSQL, because OUT 
parameters are tied to a typed target there.





Re: Add ZSON extension to /contrib/

2021-05-26 Thread Matthias van de Meent
On Wed, 26 May 2021 at 12:49, Aleksander Alekseev
 wrote:
>
> Hi hackers,
>
> Many thanks for your feedback, I very much appreciate it!
>
> > If the extension is mature enough, why make it an extension in
> > contrib, and not instead either enhance the existing jsonb type with
> > it or make it a built-in type?
>
> > IMO we have too d*mn many JSON types already.  If we can find a way
> > to shoehorn this optimization into JSONB, that'd be great.  Otherwise
> > I do not think it's worth the added user confusion.
>
> Magnus, Tom,
>
> My reasoning is that if the problem can be solved with an extension
> there is little reason to modify the core. This seems to be in the
> spirit of PostgreSQL. If the community reaches the consensus to modify
> the core to introduce a similar feature, we could discuss this as
> well. It sounds like a lot of unnecessary work to me though (see
> below).
>
> > * doesn't cover all cases, notably indexes.
>
> Tom,
>
> Not sure if I follow. What cases do you have in mind?
>
> > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.
>
> Matthias,
>
> Good point. I suspect that PostGIS is an exception though...

Quite a few other non-/common/ extensions are available in RDS[0],
some of which are HLL (from citusdata), pglogical (from 2ndQuadrant)
and orafce (from Pavel Stehule, orafce et al.).

> > I like the idea of the ZSON type, but I'm somewhat disappointed by its
> > current limitations
>
> Several people suggested various enhancements right after learning
> about ZSON. Time showed, however, that none of the real-world users
> really need e.g. more than one common dictionary per database. I
> suspect this is because no one has more than 2**16 repeatable unique
> strings (one dictionary limitation) in their documents. Thus there is
> no benefit in having separate dictionaries and corresponding extra
> complexity.

IMO the main benefit of having different dictionaries is that you
could have a small dictionary for small and very structured JSONB
fields (e.g. some time-series data), and a large one for large /
unstructured JSONB fields, without having the significant performance
impact of having that large and varied dictionary on the
small&structured field. Although a binary search is log(n) and thus
still quite cheap even for large dictionaries, the extra size is
certainly not free, and you'll be touching more memory in the process.

> > - Each dictionary uses a lot of memory, regardless of the number of
> > actual stored keys. For 32-bit systems the base usage of a dictionary
> > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
> > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
> > more than I'd want to install.
>
> You are probably right on this one, this part could be optimized. I
> will address this if we agree on submitting the patch.
>
> > - You call gettimeofday() in both dict_get and in get_current_dict_id.
> > These functions can be called in short and tight loops (for small GSON
> > fields), in which case it would add significant overhead through the
> > implied syscalls.
>
> I must admit, I'm not an expert in this area. My understanding is that
> gettimeofday() is implemented as single virtual memory access on
> modern operating systems, e.g. VDSO on Linux, thus it's very cheap.
> I'm not that sure about other supported platforms though. Probably
> worth investigating.

Yes, but vDSO does not necessarily work on all systems: e.g. in 2017,
a lot on EC2 [1] was run using Xen with vDSO not working for
gettimeofday. I'm uncertain if this issue persists for their new
KVM/Nitro hypervisor.

> > It does mean that you're deTOASTing
> > the full GSON field, and that the stored bytestring will not be
> > structured / doesn't work well with current debuggers.
>
> Unfortunately, I'm not very well aware of debugging tools in this
> context. Could you please name the debuggers I should take into
> account?

Hmm, I was mistaken in that regard. I was under the impression that at
least one of pageinspect, pg_filedump and pg_hexedit did support
column value introspection, which they apparently do not. pg_filedump
(and thus pg_hexedit) have some introspection, but none specialized
for jsonb (yet).

The point I tried to make was that introspection of GSON would be even
more difficult due to it adding a non-standard compression method
which makes introspection effectively impossible (the algorithm can
replace things other than the strings it should replace, so it will be
difficult to retrieve structure from the encoded string).

With regards,

Matthias van de Meent

[0] 
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.Extensions.13x
[1] 
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/




Re: Replacing pg_depend PIN entries with a fixed range check

2021-05-26 Thread Tom Lane
Robert Haas  writes:
> So I think your proposal of allowing genbki-assigned OIDs to be reused
> in different catalogs is probably a pretty good one, but I wonder if
> we could just rejigger things so that collations just get normal OIDs
> > 16384.

Hm.  I can't readily think of a non-hack way of making that happen.
It's also unclear to me how it'd interact with assignment of OIDs
to regular user objects.  Maybe we'll have to go there eventually,
but I'm not in a hurry to.

Meanwhile, I'll draft a patch for the other thing.

regards, tom lane




Re: CALL versus procedures with output-only arguments

2021-05-26 Thread Tom Lane
Peter Eisentraut  writes:
> AFAICT, your patch does not main the property that
>  CREATE PROCEDURE p1(OUT int, OUT int)
> corresponds to
>  DROP PROCEDURE p1(int, int)
> which would be bad.

Why?  If it actually works that way right now, I'd maintain
strenously that it's broken.  The latter should be referring
to a procedure with two IN arguments.  Even if the SQL spec
allows fuzziness about that, we cannot afford to, because we
have a more generous view of overloading than the spec does.
(As far as I could tell from looking at the spec yesterday,
they think that you aren't allowed to have two procedures
with the same name/schema and same number of arguments,
regardless of the details of those arguments.  Up with that
I will not put.)

> I'm not opposed to reverting the feature if we can't find a good 
> solution in a hurry.

I'm not looking to revert the feature.  I mainly want a saner catalog
representation, and less inconsistency in object naming (which is
tightly tied to the first thing).

regards, tom lane




Re: Skip partition tuple routing with constant partition key

2021-05-26 Thread Zhihong Yu
>
> Hi, Amit:
>

For ConvertTupleToPartition() in
0001-ExecFindPartition-cache-last-used-partition-v3.patch:

+   if (tempslot != NULL)
+   ExecClearTuple(tempslot);

If tempslot and parent_slot point to the same slot, should ExecClearTuple()
still be called ?

Cheers


Re: storing an explicit nonce

2021-05-26 Thread Robert Haas
On Tue, May 25, 2021 at 7:58 PM Stephen Frost  wrote:
> The simple thought I had was masking them out, yes.  No, you can't
> re-encrypt a different page with the same nonce.  (Re-encrypting the
> exact same page with the same nonce, however, just yields the same
> cryptotext and therefore is fine).

In the interest of not being viewed as too much of a naysayer, let me
first reiterate that I am generally in favor of TDE going forward and
am not looking to throw up unnecessary obstacles in the way of making
that happen.

That said, I don't see how this particular idea can work. When we want
to write a page out to disk, we need to identify which bits in the
page are hint bits, so that we can avoid including them in what is
encrypted, which seems complicated and expensive. But even worse, when
we then read a page back off of disk, we'd need to decrypt everything
except for the hint bits, but how do we know which bits are hint bits
if the page isn't decrypted yet? We can't annotate an 8kB page that
might be full with enough extra information to say where the
non-encrypted parts are and still have the result be guaranteed to fit
within 8kb.

Also, it's not just hint bits per se, but anything that would cause us
to use MarkBufferDirtyHint(). For a btree index, per  _bt_check_unique
and _bt_killitems, that includes the entire line pointer array,
because of how ItemIdMarkDead() is used. Even apart from the problem
of how decryption would know which things we encrypted and which
things we didn't, I really have a hard time believing that it's OK to
exclude the entire line pointer array in every btree page from
encryption from a security perspective. Among other potential
problems, that's leaking all the information an attacker could
possibly want to have about where their known plaintext might occur in
the page.

However, I believe that if we store the nonce in the page explicitly,
as proposed here, rather trying to derive it from the LSN, then we
don't need to worry about this kind of masking, which I think is
better from both a security perspective and a performance perspective.
There is one thing I'm not quite sure about, though. I had previously
imagined that each page would have a nonce and we could just do
nonce++ each time we write the page. But that doesn't quite work if
the standby can do more writes of the same page than the master. One
vague idea I have for fixing this is: let each page's 16-byte nonce
consist of 8 random bytes and an 8-byte counter that will be
incremented on every write. But, the first time a standby writes each
page, force a "key rotation" where the 8-byte random value is replaced
with a new one, different one from what the master is using for that
page. Detecting this is a bit expensive, because it probably means we
need to store the TLI that last wrote each page on every page too, but
maybe it could be made to work; we're talking about a feature that is
expensive by nature. However, I'm a little worried about the
cryptographic properties of this approach. It would often mean that an
attacker who has full filesystem access can get multiple encrypted
images of the same data, each encrypted with a different nonce. I
don't know whether that's a hazard or not, but it feels like the sort
of thing that, if I were a cryptographer, I would be pleased to have.

Another idea might be - instead of doing nonce++ every time we write
the page, do nonce=random(). That's eventually going to repeat a
value, but it's extremely likely to take a *super* long time if there
are enough bits. A potentially rather large problem, though, is that
generating random numbers in large quantities isn't very cheap.

Anybody got a better idea?

I really like your (Stephen's) idea of including something in the
special space that permits integrity checking. One thing that is quite
nice about that is we could do it first, as an independent patch,
before we did TDE. It would be an independently useful feature, and it
would mean that if there are any problems with the code that injects
stuff into the special space, we could try to track those down in a
non-TDE context. That's really good, because in a TDE context, the
pages are going to be garbled and unreadable (we hope, anyway). If we
have a problem that we can reproduce with just an integrity-checking
token shoved into every page, you can look at the page and try to
understand what went wrong. So I really like this direction both from
the point of view of improving integrity checking, and also from the
point of view of being able to debug problems.

Now, one downside of this approach is that if we have the ability to
turn integrity-checking tokens on and off, and separately we can turn
encryption on and off, then we can't simplify down to two cases as
Andres was advocating above; you have to cater to a variety of
possible values of how-much-stuff-we-squeezed-into-the-special space.
At that point you kind of end up with the approach the draft patches
were already taking, 

Re: storing an explicit nonce

2021-05-26 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, May 25, 2021 at 7:58 PM Stephen Frost  wrote:
> > The simple thought I had was masking them out, yes.  No, you can't
> > re-encrypt a different page with the same nonce.  (Re-encrypting the
> > exact same page with the same nonce, however, just yields the same
> > cryptotext and therefore is fine).
> 
> In the interest of not being viewed as too much of a naysayer, let me
> first reiterate that I am generally in favor of TDE going forward and
> am not looking to throw up unnecessary obstacles in the way of making
> that happen.

Quite glad to hear that.  Hopefully we'll all be able to get on the same
page to move TDE forward.

> That said, I don't see how this particular idea can work. When we want
> to write a page out to disk, we need to identify which bits in the
> page are hint bits, so that we can avoid including them in what is
> encrypted, which seems complicated and expensive. But even worse, when
> we then read a page back off of disk, we'd need to decrypt everything
> except for the hint bits, but how do we know which bits are hint bits
> if the page isn't decrypted yet? We can't annotate an 8kB page that
> might be full with enough extra information to say where the
> non-encrypted parts are and still have the result be guaranteed to fit
> within 8kb.

Yeah, Andres pointed that out and it's certainly an issue with this
general idea.

> Also, it's not just hint bits per se, but anything that would cause us
> to use MarkBufferDirtyHint(). For a btree index, per  _bt_check_unique
> and _bt_killitems, that includes the entire line pointer array,
> because of how ItemIdMarkDead() is used. Even apart from the problem
> of how decryption would know which things we encrypted and which
> things we didn't, I really have a hard time believing that it's OK to
> exclude the entire line pointer array in every btree page from
> encryption from a security perspective. Among other potential
> problems, that's leaking all the information an attacker could
> possibly want to have about where their known plaintext might occur in
> the page.

Also a good point.

> However, I believe that if we store the nonce in the page explicitly,
> as proposed here, rather trying to derive it from the LSN, then we
> don't need to worry about this kind of masking, which I think is
> better from both a security perspective and a performance perspective.
> There is one thing I'm not quite sure about, though. I had previously
> imagined that each page would have a nonce and we could just do
> nonce++ each time we write the page. But that doesn't quite work if
> the standby can do more writes of the same page than the master. One
> vague idea I have for fixing this is: let each page's 16-byte nonce
> consist of 8 random bytes and an 8-byte counter that will be
> incremented on every write. But, the first time a standby writes each
> page, force a "key rotation" where the 8-byte random value is replaced
> with a new one, different one from what the master is using for that
> page. Detecting this is a bit expensive, because it probably means we
> need to store the TLI that last wrote each page on every page too, but
> maybe it could be made to work; we're talking about a feature that is
> expensive by nature. However, I'm a little worried about the
> cryptographic properties of this approach. It would often mean that an
> attacker who has full filesystem access can get multiple encrypted
> images of the same data, each encrypted with a different nonce. I
> don't know whether that's a hazard or not, but it feels like the sort
> of thing that, if I were a cryptographer, I would be pleased to have.

I do agree that, in general, this is a feature that's expensive to begin
with and folks are generally going to be accepting of that.  Encrypting
the same data with different nonces will produce different results and
shouldn't be an issue.  The nonces really do need to be unique for a
given key though.

> Another idea might be - instead of doing nonce++ every time we write
> the page, do nonce=random(). That's eventually going to repeat a
> value, but it's extremely likely to take a *super* long time if there
> are enough bits. A potentially rather large problem, though, is that
> generating random numbers in large quantities isn't very cheap.

There's specific discussion about how to choose a nonce in NIST
publications and using a properly random one that's large enough is
one accepted approach, though my recollection was that the preference
was to use an incrementing guaranteed-unique nonce and using a random
one was more of a "if you can't coordinate using an incrementing one
then you can do this".  I can try to hunt for the specifics on that
though.

The issue of getting large amounts of cryptographically random numbers
seems very likely to make this not work so well though.

> Anybody got a better idea?

If we stipulate (and document) that all replicas need their own keys
th

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Robert Haas  writes:
> On Wed, May 26, 2021 at 11:13 AM Tom Lane  wrote:
>> * As things stand here, once you've applied ALTER ... SET COMPRESSION
>> to select a specific method, there is no way to undo that and go
>> back to the use-the-default setting.  All you can do is change to
>> explicitly select the other method.  Should we invent "ALTER ...
>> SET COMPRESSION default" or the like to cover that?

> Yes. Irreversible catalog changes are bad.

Here's an add-on 0004 that does that, and takes care of assorted
silliness in the grammar and docs --- did you know that this patch
caused
alter table foo alter column bar set ;
to be allowed?

I think this is about ready to commit now (though I didn't yet nuke
GetDefaultToastCompression).

regards, tom lane

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 95a302ffee..9f7f42c4aa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8242,11 +8242,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 compression method for values of compressible columns.
 (This can be overridden for individual columns by setting
 the COMPRESSION column option in
-CREATE TABLE or 
+CREATE TABLE or
 ALTER TABLE.)
-The supported compression methods are pglz and,
-if PostgreSQL was compiled with
---with-lz4, lz4.
+The supported compression methods are pglz and
+(if PostgreSQL was compiled with
+--with-lz4) lz4.
 The default is pglz.

   
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3a21129021..08b07f561e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26253,10 +26253,10 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
  pg_column_compression
 
 pg_column_compression ( "any" )
-integer
+text


-Shows the compression algorithm that was used to compress a
+Shows the compression algorithm that was used to compress
 an individual variable-length value. Returns NULL
 if the value is not compressed.

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1431d2649b..939d3fe273 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -104,7 +104,6 @@ WITH ( MODULUS numeric_literal, REM
   GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] |
   UNIQUE index_parameters |
   PRIMARY KEY index_parameters |
-  COMPRESSION compression_method |
   REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ]
 [ ON DELETE referential_action ] [ ON UPDATE referential_action ] }
 [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
@@ -391,24 +390,27 @@ WITH ( MODULUS numeric_literal, REM
 
 
  
-  This sets the compression method to be used for data inserted into a column.
-
+  This form sets the compression method for a column, determining how
+  values inserted in future will be compressed (if the storage mode
+  permits compression at all).
   This does not cause the table to be rewritten, so existing data may still
   be compressed with other compression methods.  If the table is rewritten with
   VACUUM FULL or CLUSTER, or restored
-  with pg_restore, then all tuples are rewritten
-  with the configured compression methods.
-
-  Also, note that when data is inserted from another relation (for example,
-  by INSERT ... SELECT), tuples from the source data are
-  not necessarily detoasted, and any previously compressed data is retained
-  with its existing compression method, rather than recompressing with the
-  compression methods of the target columns.
-
+  with pg_restore, then all values are rewritten
+  with the configured compression method.
+  However, when data is inserted from another relation (for example,
+  by INSERT ... SELECT), values from the source table are
+  not necessarily detoasted, so any previously compressed data may retain
+  its existing compression method, rather than being recompressed with the
+  compression method of the target column.
   The supported compression
   methods are pglz and lz4.
-  lz4 is available only if --with-lz4
-  was used when building PostgreSQL.
+  (lz4 is available only if --with-lz4
+  was used when building PostgreSQL.)  In
+  addition, compression_method
+  can be default, which selects the default behavior of
+  consulting the  setting
+  at the time of data insertion to determine the method to use.
  
 

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index a8c5e4028a..c6d0a35e50 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Justin Pryzby
On Wed, May 26, 2021 at 11:13:46AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > Ah, the parallel with reltablespace and default_tablespace at database
> > level is a very good point.  It is true that currently the code would
> > assign attcompression to a non-zero value once the relation is defined
> > depending on default_toast_compression set for the database, but
> > setting it to 0 in this case would be really helpful to change the
> > compression methods of all the relations if doing something as crazy
> > as a VACUUM FULL for this database.  Count me as convinced.
> 
> Here's a draft patch series to address this.
> 
> 0001 removes the relkind checks I was questioning originally.
> As expected, this results in zero changes in check-world results.
> 
> 0002 is the main change in the semantics of attcompression.
> This does change the results of compression.sql, but in what
> seem to me to be expected ways: a column's compression option
> is now shown in \d+ output only if you explicitly set it.
> 
> 0003 further removes pg_dump's special handling of
> default_toast_compression.  I don't think we need that anymore.
> AFAICS its only effect would be to override the receiving server's
> default_toast_compression setting for dumped/re-loaded data, which
> does not seem like a behavior that anyone would want.
> 
> Loose ends:
> 
> * I've not reviewed the docs fully; there are likely some more
> things that need updated.
> 
> * As things stand here, once you've applied ALTER ... SET COMPRESSION
> to select a specific method, there is no way to undo that and go
> back to the use-the-default setting.  All you can do is change to
> explicitly select the other method.  Should we invent "ALTER ...
> SET COMPRESSION default" or the like to cover that?  (Since
> DEFAULT is a reserved word, that exact syntax might be a bit of
> a pain to implement, but maybe we could think of another word.)
> 
> * I find GetDefaultToastCompression() annoying.  I do not think
> it is project style to invent trivial wrapper functions around
> GUC variable references: it buys nothing while requiring readers
> to remember one more name than they would otherwise.  Since there
> are only two uses remaining, maybe this isn't very important either
> way, but I'm still inclined to flush it.

+1

It existed when default_toast_compression was a text string.  Since e5595de03,
it's an enum/int/char, and serves no purpose.

-- 
Justin




Re: storing an explicit nonce

2021-05-26 Thread Bruce Momjian
On Wed, May 26, 2021 at 07:14:47AM +0200, Antonin Houska wrote:
> Bruce Momjian  wrote:
> 
> > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote:
> > > > So, let me ask --- I thought CTR basically took an encrypted stream of
> > > > bits and XOR'ed them with the data.  If that is true, then why are
> > > > changing hint bits a problem?  We already can see some of the bit stream
> > > > by knowing some bytes of the page.
> > > 
> > > A *single* reuse of the nonce in CTR reveals nearly all of the
> > > plaintext. As you say, the data is XORed with the key stream. Reusing
> > > the nonce means that you reuse the key stream. Which in turn allows you
> > > to do:
> > >   (data ^ stream) ^ (data' ^ stream)
> > > which can be simplified to
> > >   (data ^ data')
> > > thereby leaking all of data except the difference between data and
> > > data'. That's why it's so crucial to ensure that stream *always* differs
> > > between two rounds of encrypting "related" data.
> > > 
> > > We can't just "hope" that data doesn't change and use CTR.
> > 
> > My point was about whether we need to change the nonce, and hence
> > WAL-log full page images if we change hint bits.  If we don't and
> > reencrypt the page with the same nonce, don't we only expose the hint
> > bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> > cases.
> > 
> > I don't understand your computation above.  You decrypt the page into
> > shared buffers, you change a hint bit, and rewrite the page.  You are
> > re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> > only change the hint bits in the new write?
> 
> The way I view things is that the CTR mode encrypts each individual bit,
> independent from any other bit on the page. For non-hint bits data=data', so
> (data ^ data') is always zero, regardless the actual values of the data. So I
> agree with you that by reusing the nonce we only expose the hint bits.

OK, that's what I thought.  We already expose the clog and fsm, so
exposing the hint bits seems acceptable.  If everyone agrees, I will
adjust my patch to not WAL log hint bit changes.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-26 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
> > Another idea might be - instead of doing nonce++ every time we write
> > the page, do nonce=random(). That's eventually going to repeat a
> > value, but it's extremely likely to take a *super* long time if there
> > are enough bits. A potentially rather large problem, though, is that
> > generating random numbers in large quantities isn't very cheap.
> 
> There's specific discussion about how to choose a nonce in NIST
> publications and using a properly random one that's large enough is
> one accepted approach, though my recollection was that the preference
> was to use an incrementing guaranteed-unique nonce and using a random
> one was more of a "if you can't coordinate using an incrementing one
> then you can do this".  I can try to hunt for the specifics on that
> though.

Disucssion of generating IVs here:

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf

section 8.2 specifically.

Note that 8.3 also discusses subsequent limitations which one should
follow when using a random nonce, to reduce the chances of a collision.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: storing an explicit nonce

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 2:37 PM Stephen Frost  wrote:
> > Anybody got a better idea?
>
> If we stipulate (and document) that all replicas need their own keys
> then we no longer need to worry about nonce re-use between the primary
> and the replica.  Not sure that's *better*, per se, but I do think it's
> worth consideration.  Teaching pg_basebackup how to decrypt and then
> re-encrypt with a different key wouldn't be challenging.

I agree that we could do that and that it's possible worth
considering. However, it would be easy - and tempting - for users to
violate the no-nonce-reuse principle. For example, consider a
hypothetical user who takes a backup on Monday via a filesystem
snapshot - which might be either (a) a snapshot of the cluster while
it is stopped, or (b) a snapshot of the cluster while it's running,
from which crash recovery can be safely performed as long as it's a
true atomic snapshot, or (c) a snapshot taken between pg_start_backup
and pg_stop_backup which will be used just like a backup taken by
pg_basebackup. In any of these cases, there's no opportunity for a
tool we provide to intervene and re-key. Now, we would provide a tool
that re-keys in such situations and tell people to be sure they run it
before using any of those backups, and maybe that's the best we can
do. However, that tool is going to run for a good long time because it
has to rewrite the entire cluster, so someone with a terabyte-scale
database is going to be sorely tempted to skip this "unnecessary" and
time-consuming step. If it were possible to set things up so that good
things happen automatically and without user action, that would be
swell.

Here's another idea: suppose that a nonce is 128 bits, 64 of which are
randomly generated at server startup, and the other 64 of which are a
counter. If you're willing to assume that the 64 bits generated
randomly at server startup are not going to collide in practice,
because the number of server lifetimes per key should be very small
compared to 2^64, then this gets you the benefits of a
randomly-generate nonce without needing to keep on generating new
cryptographically strong random numbers, and pretty much regardless of
what users do with their backups. If you replay an FPI, you can write
out the page exactly as you got it from the master, without
re-encrypting. If you modify and then write a page, you generate a
nonce for it containing your own server lifetime identifier.

> Yes, if the amount of space available is variable then there's an added
> cost for that.  While I appreciate the concern about having that be
> expensive, for my 2c at least, I like to think that having this sudden
> space that's available for use may lead to other really interesting
> capabilities beyond the ones we're talking about here, so I'm not really
> thrilled with the idea of boiling it down to just two cases.

Although I'm glad you like some things about this idea, I think the
proposed system will collapse if we press it too hard. We're going to
need to be judicious.

> One thing to be absolutely clear about here though is that simply taking
> a hash() of the ciphertext and storing that with the data does *not*
> provide cryptographic data integrity validation for the page because it
> doesn't involve the actual key or IV at all and the hash is done after
> the ciphertext is generated- therefore an attacker can change the data
> and just change the hash to match and you'd never know.

Ah, right. So you'd actually want something more like
hash(dboid||tsoid||relfilenode||blockno||block_contents||secret).
Maybe not generated exactly that way: perhaps the secret is really the
IV for the hash function rather than part of the hashed data, or
whatever. However you do it exactly, it prevents someone from
verifying - or faking - a signature unless they have the secret.

> very hard for the attacker to discover) and suddently you're doing what
> AES-GCM *already* does for you, except you're trying to hack it yourself
> instead of using the tools available which were written by experts.

I am all in favor of using the expert-written tools provided we can
figure out how to do it in a way we all agree is correct.

> What this means for your proposal above is that the actual data
> validation information will be generated in two different ways depending
> on if we're using AES-GCM and doing TDE, or if we're doing just the data
> validation piece and not encrypting anything.  That's maybe not ideal
> but I don't think it's a huge issue either and your proposal will still
> address the question of if we end up missing anything when it comes to
> how the special area is handled throughout the code.

Hmm. Is there no expert-written method for this sort of thing without
encryption? One thing that I think would be really helpful is to be
able to take a TDE-ified cluster and run it through decryption, ending
up with a cluster that still has extra special space but which isn't
actually encrypted any more. Ideally

Re: storing an explicit nonce

2021-05-26 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> OK, that's what I thought.  We already expose the clog and fsm, so
> exposing the hint bits seems acceptable.  If everyone agrees, I will
> adjust my patch to not WAL log hint bit changes.

Robert pointed out that it's not just hint bits where this is happening
though, but it can also happen with btree line pointer arrays.  Even if
we were entirely comfortable accepting that the hint bits are leaked
because of this, leaking the btree line pointer array doesn't seem like
it could possibly be acceptable..

I've not run down that code myself, but I don't have any reason to doubt
Robert's assessment.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
I wrote:
> I think this is about ready to commit now (though I didn't yet nuke
> GetDefaultToastCompression).

Here's a bundled-up final version, in case anybody would prefer
to review it that way.

regards, tom lane

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b26b872a06..16493209c6 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1261,10 +1261,14 @@
attcompression char
   
   
-   The current compression method of the column.  If it is an invalid
-   compression method ('\0') then column data will not
-   be compressed.  Otherwise, 'p' = pglz compression or
-   'l' = LZ4 compression.
+   The current compression method of the column.  Typically this is
+   '\0' to specify use of the current default setting
+   (see ).  Otherwise,
+   'p' selects pglz compression, while
+   'l' selects LZ4
+   compression.  However, this field is ignored
+   whenever attstorage does not allow
+   compression.
   
  
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7e32b0686c..9f7f42c4aa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8239,13 +8239,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;

 This variable sets the default
 TOAST
-compression method for columns of newly-created tables. The
-CREATE TABLE statement can override this default
-by specifying the COMPRESSION column option.
-
-The supported compression methods are pglz and,
-if PostgreSQL was compiled with
---with-lz4, lz4.
+compression method for values of compressible columns.
+(This can be overridden for individual columns by setting
+the COMPRESSION column option in
+CREATE TABLE or
+ALTER TABLE.)
+The supported compression methods are pglz and
+(if PostgreSQL was compiled with
+--with-lz4) lz4.
 The default is pglz.

   
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3a21129021..08b07f561e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26253,10 +26253,10 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
  pg_column_compression
 
 pg_column_compression ( "any" )
-integer
+text


-Shows the compression algorithm that was used to compress a
+Shows the compression algorithm that was used to compress
 an individual variable-length value. Returns NULL
 if the value is not compressed.

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1431d2649b..939d3fe273 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -104,7 +104,6 @@ WITH ( MODULUS numeric_literal, REM
   GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] |
   UNIQUE index_parameters |
   PRIMARY KEY index_parameters |
-  COMPRESSION compression_method |
   REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ]
 [ ON DELETE referential_action ] [ ON UPDATE referential_action ] }
 [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
@@ -391,24 +390,27 @@ WITH ( MODULUS numeric_literal, REM
 
 
  
-  This sets the compression method to be used for data inserted into a column.
-
+  This form sets the compression method for a column, determining how
+  values inserted in future will be compressed (if the storage mode
+  permits compression at all).
   This does not cause the table to be rewritten, so existing data may still
   be compressed with other compression methods.  If the table is rewritten with
   VACUUM FULL or CLUSTER, or restored
-  with pg_restore, then all tuples are rewritten
-  with the configured compression methods.
-
-  Also, note that when data is inserted from another relation (for example,
-  by INSERT ... SELECT), tuples from the source data are
-  not necessarily detoasted, and any previously compressed data is retained
-  with its existing compression method, rather than recompressing with the
-  compression methods of the target columns.
-
+  with pg_restore, then all values are rewritten
+  with the configured compression method.
+  However, when data is inserted from another relation (for example,
+  by INSERT ... SELECT), values from the source table are
+  not necessarily detoasted, so any previously compressed data may retain
+  its existing compression method, rather than being recompressed with the
+  compression method of the target column.
   The supported compression
   methods are pglz and lz4.
-  lz4 is available only if --with-lz4
-  was used when building PostgreSQL.
+  (lz4 i

Re: Commitfest app vs. pgsql-docs

2021-05-26 Thread Magnus Hagander
On Mon, May 24, 2021 at 7:21 PM Julien Rouhaud  wrote:
>
> On Mon, May 24, 2021 at 11:03 PM Andrew Dunstan  wrote:
> >
> > On 5/24/21 10:55 AM, Magnus Hagander wrote:
> > > On Mon, May 24, 2021 at 4:18 PM Andrew Dunstan  
> > > wrote:
> > >>
> > >> On 5/24/21 8:42 AM, Daniel Gustafsson wrote:
> >  On 24 May 2021, at 11:47, Magnus Hagander  wrote:
> > 
> >  On Wed, May 19, 2021 at 11:08 PM Alvaro Herrera 
> >   wrote:
> > > On 2021-May-19, Andrew Dunstan wrote:
> > >
> > >> It's just a reference after all. So someone supplies a reference to 
> > >> an
> > >> email on an out of the way list. What's the evil that will occur? Not
> > >> much really  AFAICT.
> >  Well, if you include all lists, the ability for you to findi things by
> >  the "most recent posts" or by searching for anything other than a
> >  unique message id will likely become less useful.
> > >>> Thats a good case for restricting this to the smaller set of lists 
> > >>> which will
> > >>> cover most submissions anyways.  With a smaller set we could make the 
> > >>> UX still
> > >>> work without presenting an incredibly long list.
> > >>>
> > >>> Or, the most recent emails dropdown only cover a set of common lists but
> > >>> a search will scan all lists?
> > >>>
> >  As long as you only ever search by message-id it won't make a 
> >  difference.
> > >>> Without any supporting evidence to back it up, my gut feeling tells me 
> > >>> the most
> > >>> recent mails list is a good/simple way to lower the bar for submissions.
> > >>>
> > >> Maybe. I only ever do this by using an exact message-id, since that's
> > >> what the web form specifically asks for :-)
> > > The webform lets you either do a free text search, or pick from a
> > > list, or enter a message-id, no?
> >
> >
> >
> > Yes it does, but the text next to the field says "Specify thread msgid:".
>
> Yes, I've always been confused by that form.  I may have tried to
> enter some free text once but AFAIR I always use the specific
> message-id.

This is clearly in need of a better UX. Any suggestions on how would
be welcome. Would be enough to just say "Or specify... "?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: storing an explicit nonce

2021-05-26 Thread Bruce Momjian
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote:
> However, I believe that if we store the nonce in the page explicitly,
> as proposed here, rather trying to derive it from the LSN, then we
> don't need to worry about this kind of masking, which I think is
> better from both a security perspective and a performance perspective.

You are saying that by using a non-LSN nonce, you can write out the page
with a new nonce, but the same LSN, and also discard the page during
crash recovery and use the WAL copy?

I am confused why checksums, which are widely used, acceptably require
wal_log_hints, but there is concern that file encryption, which is
heavier, cannot acceptably require wal_log_hints.  I must be missing
something.

Why can't checksums also throw away hint bit changes like you want to do
for file encryption and not require wal_log_hints?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-26 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, May 26, 2021 at 2:37 PM Stephen Frost  wrote:
> > > Anybody got a better idea?
> >
> > If we stipulate (and document) that all replicas need their own keys
> > then we no longer need to worry about nonce re-use between the primary
> > and the replica.  Not sure that's *better*, per se, but I do think it's
> > worth consideration.  Teaching pg_basebackup how to decrypt and then
> > re-encrypt with a different key wouldn't be challenging.
> 
> I agree that we could do that and that it's possible worth
> considering. However, it would be easy - and tempting - for users to
> violate the no-nonce-reuse principle. For example, consider a

(guessing you meant no-key-reuse above)

> hypothetical user who takes a backup on Monday via a filesystem
> snapshot - which might be either (a) a snapshot of the cluster while
> it is stopped, or (b) a snapshot of the cluster while it's running,
> from which crash recovery can be safely performed as long as it's a
> true atomic snapshot, or (c) a snapshot taken between pg_start_backup
> and pg_stop_backup which will be used just like a backup taken by
> pg_basebackup. In any of these cases, there's no opportunity for a
> tool we provide to intervene and re-key. Now, we would provide a tool
> that re-keys in such situations and tell people to be sure they run it
> before using any of those backups, and maybe that's the best we can
> do. However, that tool is going to run for a good long time because it
> has to rewrite the entire cluster, so someone with a terabyte-scale
> database is going to be sorely tempted to skip this "unnecessary" and
> time-consuming step. If it were possible to set things up so that good
> things happen automatically and without user action, that would be
> swell.

Yes, if someone were to use a snapshot and set up a replica from it
they'd end up with the same key being used and potentially have an issue
with the key+nonce combination being re-used between the primary and
replica with different data leading to a possible data leak.

> Here's another idea: suppose that a nonce is 128 bits, 64 of which are
> randomly generated at server startup, and the other 64 of which are a
> counter. If you're willing to assume that the 64 bits generated
> randomly at server startup are not going to collide in practice,
> because the number of server lifetimes per key should be very small
> compared to 2^64, then this gets you the benefits of a
> randomly-generate nonce without needing to keep on generating new
> cryptographically strong random numbers, and pretty much regardless of
> what users do with their backups. If you replay an FPI, you can write
> out the page exactly as you got it from the master, without
> re-encrypting. If you modify and then write a page, you generate a
> nonce for it containing your own server lifetime identifier.

Yes, this kind of approach is discussed in the NIST publication in
section 8.2.2.  We'd have to keep track of what nonce we used for which
page, of course, but that should be alright using the special space as
discussed.

> > Yes, if the amount of space available is variable then there's an added
> > cost for that.  While I appreciate the concern about having that be
> > expensive, for my 2c at least, I like to think that having this sudden
> > space that's available for use may lead to other really interesting
> > capabilities beyond the ones we're talking about here, so I'm not really
> > thrilled with the idea of boiling it down to just two cases.
> 
> Although I'm glad you like some things about this idea, I think the
> proposed system will collapse if we press it too hard. We're going to
> need to be judicious.

Sure.

> > One thing to be absolutely clear about here though is that simply taking
> > a hash() of the ciphertext and storing that with the data does *not*
> > provide cryptographic data integrity validation for the page because it
> > doesn't involve the actual key or IV at all and the hash is done after
> > the ciphertext is generated- therefore an attacker can change the data
> > and just change the hash to match and you'd never know.
> 
> Ah, right. So you'd actually want something more like
> hash(dboid||tsoid||relfilenode||blockno||block_contents||secret).
> Maybe not generated exactly that way: perhaps the secret is really the
> IV for the hash function rather than part of the hashed data, or
> whatever. However you do it exactly, it prevents someone from
> verifying - or faking - a signature unless they have the secret.
> 
> > very hard for the attacker to discover) and suddently you're doing what
> > AES-GCM *already* does for you, except you're trying to hack it yourself
> > instead of using the tools available which were written by experts.
> 
> I am all in favor of using the expert-written tools provided we can
> figure out how to do it in a way we all agree is correct.

In the patch set that Bruce has which uses the OpenSSL functions 

Re: storing an explicit nonce

2021-05-26 Thread Bruce Momjian
On Wed, May 26, 2021 at 03:49:43PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > OK, that's what I thought.  We already expose the clog and fsm, so
> > exposing the hint bits seems acceptable.  If everyone agrees, I will
> > adjust my patch to not WAL log hint bit changes.
> 
> Robert pointed out that it's not just hint bits where this is happening
> though, but it can also happen with btree line pointer arrays.  Even if
> we were entirely comfortable accepting that the hint bits are leaked
> because of this, leaking the btree line pointer array doesn't seem like
> it could possibly be acceptable..
> 
> I've not run down that code myself, but I don't have any reason to doubt
> Robert's assessment.

OK, I guess we could split out log_hints to maybe just FPW-log btree
changes or something, but my recent email questions why wal_log_hints is
an issue anyway.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-26 Thread Bruce Momjian
On Wed, May 26, 2021 at 04:40:48PM -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote:
> > However, I believe that if we store the nonce in the page explicitly,
> > as proposed here, rather trying to derive it from the LSN, then we
> > don't need to worry about this kind of masking, which I think is
> > better from both a security perspective and a performance perspective.
> 
> You are saying that by using a non-LSN nonce, you can write out the page
> with a new nonce, but the same LSN, and also discard the page during
> crash recovery and use the WAL copy?
> 
> I am confused why checksums, which are widely used, acceptably require
> wal_log_hints, but there is concern that file encryption, which is
> heavier, cannot acceptably require wal_log_hints.  I must be missing
> something.
> 
> Why can't checksums also throw away hint bit changes like you want to do
> for file encryption and not require wal_log_hints?

One detail might be this extra hint bit FPW case:


https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch

However, if a hint-bit-modified page is written to the file system
during a checkpoint, and there is a later hint bit change switching
the same page from clean to dirty during the same checkpoint, we
need a new LSN, and wal_log_hints doesn't give us a new LSN here.
The fix for this is to update the page LSN by writing a dummy
WAL record via xloginsert.c::LSNForEncryption() in such cases.

Is this how file encryption is different from checksum wal_log_hints,
and the big concern?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Add PortalDrop in exec_execute_message

2021-05-26 Thread Alvaro Herrera
On 2021-May-25, Yura Sokolov wrote:

> Tom Lane писал 2021-05-21 21:23:
> > Yura Sokolov  writes:
> > > I propose to add PortalDrop at the 'if (completed)' branch of
> > > exec_execute_message.
> > 
> > This violates our wire protocol specification, which
> > specifically says
> > 
> > If successfully created, a named portal object lasts till the end of
> > the current transaction, unless explicitly destroyed. An unnamed
> > portal is destroyed at the end of the transaction, or as soon as the
> > next Bind statement specifying the unnamed portal as destination is
> > issued. (Note that a simple Query message also destroys the unnamed
> > portal.)
> > 
> > I'm inclined to think that your complaint would be better handled
> > by having the client send a portal-close command, if it's not
> > going to do something else immediately.
> 
> I thought about it as well. Then, if I understand correctly,
> PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send
> "close portal" (CP) message after "execute" message, right?

I don't think they should do that.  The portal remains open, and the
libpq interface does that.  The portal gets closed at end of transaction
without the need for any client message.  I think if the client wanted
to close the portal ahead of time, it would need a new libpq entry point
to send the message to do that.

(I didn't add a Close Portal message to PQsendQueryInternal in pipeline
mode precisely because there is no such message in PQsendQueryGuts.
I think it would be wrong to unconditionally add a Close Portal message
to any of those places.)

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: storing an explicit nonce

2021-05-26 Thread Bruce Momjian
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote:
> In the interest of not being viewed as too much of a naysayer, let me
> first reiterate that I am generally in favor of TDE going forward and
> am not looking to throw up unnecessary obstacles in the way of making
> that happen.

Rather than surprise anyone, I might as well just come out and say some
things.  First, I have always admitted this feature has limited
usefulness.  

I think a non-LSN nonce adds a lot of code complexity, which adds a code
and maintenance burden.  It also prevents the creation of an encrypted
replica from a non-encrypted primary using binary replication, which
makes deployment harder.

Take a feature of limited usefulness, add code complexity and deployment
difficulty, and the feature becomes even less useful.

For these reasons, if we decide to go in the direction of using a
non-LSN nonce, I no longer plan to continue working on this feature. I
would rather work on things that have a more positive impact.  Maybe a
non-LSN nonce is a better long-term plan, but there are too many
unknowns and complexity for me to feel comfortable with it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Add PortalDrop in exec_execute_message

2021-05-26 Thread Tom Lane
Alvaro Herrera  writes:
> (I didn't add a Close Portal message to PQsendQueryInternal in pipeline
> mode precisely because there is no such message in PQsendQueryGuts.
> I think it would be wrong to unconditionally add a Close Portal message
> to any of those places.)

Yeah, I'm not very comfortable with having libpq take it on itself
to do that, either.

Looking back at the original complaint, it seems like it'd be fair to
wonder why we're still holding a page pin in a supposedly completed
executor run.  Maybe the right fix is somewhere in the executor
scan logic.

regards, tom lane




Speed up pg_checksums in cases where checksum already set

2021-05-26 Thread Greg Sabino Mullane
The attached patch makes an optimization to pg_checksums which prevents
rewriting the block if the checksum is already what we expect. This can
lead to much faster runs in cases where it is already set (e.g. enabled ->
disabled -> enable, external helper process, interrupted runs, future
parallel processes). There is also an effort to not sync the data directory
if no changes were written. Finally, added a bit more output on how many
files were actually changed, e.g.:

Checksum operation completed
Files scanned:   1236
Blocks scanned:  23283
Files modified:  38
Blocks modified: 19194
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster

Cheers,
Greg


pg_checksums.optimize.writes.patch
Description: Binary data


Re: Add ZSON extension to /contrib/

2021-05-26 Thread Bruce Momjian
On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote:
> Hi hackers,
> 
> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
> The
> extension introduces the new ZSON type, which is 100% compatible with JSONB 
> but
> uses a shared dictionary of strings most frequently used in given JSONB
> documents for compression. These strings are replaced with integer IDs.
> Afterward, PGLZ (and now LZ4) applies if the document is large enough by 
> common
> PostgreSQL logic. Under certain conditions (many large documents), this saves
> disk space, memory and increases the overall performance. More details can be
> found in README on GitHub.

I think this is interesting because it is one of the few cases that
allow compression outside of a single column.  Here is a list of
compression options:

https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020

1. single field
2. across rows in a single page
3. across rows in a single column
4. across all columns and rows in a table
5. across tables in a database
6. across databases

While standard Postgres does #1, ZSON allows 2-5, assuming the data is
in the ZSON data type.  I think this cross-field compression has great
potential for cases where the data is not relational, or hasn't had time
to be structured relationally.  It also opens questions of how to do
this cleanly in a relational system.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Incorrect usage of strtol, atoi for non-numeric junk inputs

2021-05-26 Thread Alvaro Herrera
On 2021-May-19, Bharath Rupireddy wrote:

> While working on [1], I found that some parts of the code is using
> strtol and atoi without checking for non-numeric junk input strings. I
> found this strange. Most of the time users provide proper numeric
> strings but there can be some scenarios where these strings are not
> user-supplied but generated by some other code which may contain
> non-numeric strings. Shouldn't the code use strtol or atoi
> appropriately and error out in such cases?  One way to fix this once
> and for all is to have a common API something like int
> pg_strtol/pg_str_convert_to_int(char *opt_name, char *opt_value) which
> returns a generic message upon invalid strings ("invalid value \"%s\"
> is provided for option \"%s\", opt_name, opt_value) and returns
> integers on successful parsing.

Hi, how is this related to
https://postgr.es/m/20191028012000.ga59...@begriffs.com ?

-- 
Álvaro Herrera   Valdivia, Chile




Reducing the range of OIDs consumed by genbki.pl

2021-05-26 Thread Tom Lane
The attached patch stems from the conversation at [1];
I'm starting a new thread to avoid confusing the cfbot.

Briefly, the idea is to allow reverting the change made in
commit ab596105b to increase FirstBootstrapObjectId from
12000 to 13000, by teaching genbki.pl to assign OIDs
independently in each catalog rather than from a single
OID counter.  Thus, the OIDs in this range will not be
globally unique anymore, but only unique per-catalog.

The aforesaid commit had to increase FirstBootstrapObjectId
because as of HEAD, genbki.pl needs to consume OIDs up through
12035, overrunning the old limit of 12000.  But moving up that
limit seems a bit risky, cf [2].  It'd be better if we could
avoid doing that.  Since the OIDs in question are spread across
several catalogs, allocating them per-catalog seems to fix the
problem quite effectively.  With the attached patch, the ending
OID counters are

GenbkiNextOid(pg_amop) = 10945
GenbkiNextOid(pg_amproc) = 10697
GenbkiNextOid(pg_cast) = 10230
GenbkiNextOid(pg_opclass) = 10164

so we have quite a lot of daylight before we'll ever approach
12000 again.

Per-catalog OID uniqueness shouldn't be a problem here, because
any code that's assuming global uniqueness is broken anyway;
no such guarantee exists after the OID counter has wrapped
around.

So I propose shoehorning this into v14, to avoid shipping a
release where FirstBootstrapObjectId has been bumped up.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/3737988.1618451008%40sss.pgh.pa.us

[2] 
https://www.postgresql.org/message-id/flat/CAGPqQf3JYTrTB1E1fu_zOGj%2BrG_kwTfa3UcUYPfNZL9o1bcYNw%40mail.gmail.com

diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index b33e59d5e4..db1b3d5e9a 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -418,11 +418,11 @@

 If genbki.pl needs to assign an OID to a catalog
 entry that does not have a manually-assigned OID, it will use a value in
-the range 1—12999.  The server's OID counter is set to 13000
+the range 1—11999.  The server's OID counter is set to 12000
 at the start of a bootstrap run.  Thus objects created by regular SQL
 commands during the later phases of bootstrap, such as objects created
 while running the information_schema.sql script,
-receive OIDs of 13000 or above.
+receive OIDs of 12000 or above.

 

diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 112b3affdf..e5e774267e 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -167,15 +167,17 @@ foreach my $oid (keys %oidcounts)
 die "found $found duplicate OID(s) in catalog data\n" if $found;
 
 
-# Oids not specified in the input files are automatically assigned,
+# OIDs not specified in the input files are automatically assigned,
 # starting at FirstGenbkiObjectId, extending up to FirstBootstrapObjectId.
+# We allow such OIDs to be assigned independently within each catalog.
 my $FirstGenbkiObjectId =
   Catalog::FindDefinedSymbol('access/transam.h', $include_path,
 	'FirstGenbkiObjectId');
 my $FirstBootstrapObjectId =
   Catalog::FindDefinedSymbol('access/transam.h', $include_path,
 	'FirstBootstrapObjectId');
-my $GenbkiNextOid = $FirstGenbkiObjectId;
+# Hash of next available OID, indexed by catalog name.
+my %GenbkiNextOids;
 
 
 # Fetch some special data that we will substitute into the output file.
@@ -563,8 +565,7 @@ EOM
 			# Assign oid if oid column exists and no explicit assignment in row
 			if ($attname eq "oid" and not defined $bki_values{$attname})
 			{
-$bki_values{$attname} = $GenbkiNextOid;
-$GenbkiNextOid++;
+$bki_values{$attname} = assign_next_oid($catname);
 			}
 
 			# Replace OID synonyms with OIDs per the appropriate lookup rule.
@@ -669,11 +670,6 @@ foreach my $declaration (@index_decls)
 # last command in the BKI file: build the indexes declared above
 print $bki "build indices\n";
 
-# check that we didn't overrun available OIDs
-die
-  "genbki OID counter reached $GenbkiNextOid, overrunning FirstBootstrapObjectId\n"
-  if $GenbkiNextOid > $FirstBootstrapObjectId;
-
 # Now generate system_constraints.sql
 
 foreach my $c (@system_constraints)
@@ -1081,6 +1077,25 @@ sub form_pg_type_symbol
 	return $name . $arraystr . 'OID';
 }
 
+# Assign an unused OID within the specified catalog.
+sub assign_next_oid
+{
+	my $catname = shift;
+
+	# Initialize, if no previous request for this catalog.
+	$GenbkiNextOids{$catname} = $FirstGenbkiObjectId
+	  if !defined($GenbkiNextOids{$catname});
+
+	my $result = $GenbkiNextOids{$catname}++;
+
+	# Check that we didn't overrun available OIDs
+	die
+	  "genbki OID counter for $catname reached $result, overrunning FirstBootstrapObjectId\n"
+	  if $result >= $FirstBootstrapObjectId;
+
+	return $result;
+}
+
 sub usage
 {
 	die <

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Alvaro Herrera
On 2021-May-26, Tom Lane wrote:

> I wrote:
> > I think this is about ready to commit now (though I didn't yet nuke
> > GetDefaultToastCompression).
> 
> Here's a bundled-up final version, in case anybody would prefer
> to review it that way.

Looks good to me.

I tested the behavior with partitioned tables and it seems OK.

It would be good to have a test case in src/bin/pg_dump/t/002_pg_dump.pl
for the case ... and I find it odd that we don't seem to have anything
for the "CREATE TABLE foo (LIKE sometab INCLUDING stuff)" form of the
command ... but neither of those seem the fault of this patch, and they
both work as [I think] is intended.

-- 
Álvaro Herrera   Valdivia, Chile
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Alvaro Herrera  writes:
> Looks good to me.
> I tested the behavior with partitioned tables and it seems OK.

Thanks for reviewing/testing!

> It would be good to have a test case in src/bin/pg_dump/t/002_pg_dump.pl
> for the case

Personally I won't touch 002_pg_dump.pl with a 10-foot pole, but if
somebody else wants to, have at it.

> ... and I find it odd that we don't seem to have anything
> for the "CREATE TABLE foo (LIKE sometab INCLUDING stuff)" form of the
> command ... but neither of those seem the fault of this patch, and they
> both work as [I think] is intended.

Hm, there's this in compression.sql:

-- test LIKE INCLUDING COMPRESSION
CREATE TABLE cmdata2 (LIKE cmdata1 INCLUDING COMPRESSION);
\d+ cmdata2

Or did you mean the case with a partitioned table specifically?

regards, tom lane




Re: storing an explicit nonce

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-26 07:14:47 +0200, Antonin Houska wrote:
> Bruce Momjian  wrote:
> > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > My point was about whether we need to change the nonce, and hence
> > WAL-log full page images if we change hint bits.  If we don't and
> > reencrypt the page with the same nonce, don't we only expose the hint
> > bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> > cases.
> >
> > I don't understand your computation above.  You decrypt the page into
> > shared buffers, you change a hint bit, and rewrite the page.  You are
> > re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> > only change the hint bits in the new write?

Yea, I had a bit of a misfire there. Sorry.

I suspect that if we try to not disclose data if an attacker has write
access, this still leaves us with issues around nonce reuse, unless we
also employ integrity measures. Particularly due to CTR mode, which
makes it easy to manipulate individual parts of the encrypted page
without causing the decrypted page to be invalid. E.g. the attacker can
just update pd_upper on the page by a small offset, and suddenly the
replay will insert the tuple at a slightly shifted offset - which then
seems to leak enough data to actually analyze things?

As the patch stands that seems trivially doable, because as I read it,
most of the page header is not encrypted, and checksums are done of the
already encrypted data. But even if that weren't the case, brute forcing
16bit worth of checksum isn't too bad, even though it would obviously
make an attack a lot more noisy.

https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92
/*
 * Check if the page has a special size == GISTPageOpaqueData, a valid
 * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN.  
This
 * is true for all GiST pages, and perhaps a few pages that are not.  
The
 * only downside of guessing wrong is that we might not update the LSN 
for
 * some non-permanent relation page changes, and therefore reuse the IV,
 * which seems acceptable.
 */

Huh?

Regards,

Andres




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Alvaro Herrera
On 2021-May-26, Tom Lane wrote:

> Alvaro Herrera  writes:

> > It would be good to have a test case in src/bin/pg_dump/t/002_pg_dump.pl
> > for the case
> 
> Personally I won't touch 002_pg_dump.pl with a 10-foot pole, but if
> somebody else wants to, have at it.

Nod.

> > ... and I find it odd that we don't seem to have anything
> > for the "CREATE TABLE foo (LIKE sometab INCLUDING stuff)" form of the
> > command ... but neither of those seem the fault of this patch, and they
> > both work as [I think] is intended.
> 
> Hm, there's this in compression.sql:
> 
> -- test LIKE INCLUDING COMPRESSION
> CREATE TABLE cmdata2 (LIKE cmdata1 INCLUDING COMPRESSION);
> \d+ cmdata2
> 
> Or did you mean the case with a partitioned table specifically?

Ah, I guess that's sufficient.  (The INCLUDING clause cannot be used to
create a partition, actually.)

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Now I have my system running, not a byte was off the shelf;
It rarely breaks and when it does I fix the code myself.
It's stable, clean and elegant, and lightning fast as well,
And it doesn't cost a nickel, so Bill Gates can go to hell."




Re: storing an explicit nonce

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-25 22:23:46 -0400, Stephen Frost wrote:
> Andres mentioned other possible cases where the LSN doesn’t change even
> though we change the page and, as he’s probably right, we would have to
> figure out a solution in those cases too (potentially including cases like
> crash recovery or replay on a replica where we can’t really just go around
> creating dummy WAL records to get new LSNs..).

Yea, I think there's quite a few of those. For one, we don't guarantee
that that the hole between pd_lower/upper is zeroes. It e.g. contains
old tuple data after deleted tuples are pruned away. But when logging an
FPI, we omit that range. Which means that after crash recovery the area
is zeroed out. There's several cases where padding can result in the
same.

Just look at checkXLogConsistency(), heap_mask() et al for all the
differences that can occur and that need to be ignored for the recovery
consistency checking to work.

Particularly the hole issue seems trivial to exploit, because we know
the plaintext of the hole after crash recovery (0s).


I don't see how using the LSN alone is salvagable.


Greetings,

Andres Freund




Re: storing an explicit nonce

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> If we used a block cipher instead of a streaming one (CTR), this might
> not work because the earlier blocks can be based in the output of
> later blocks.

What made us choose CTR for WAL & data file encryption? I checked the
README in the patchset and the wiki page, and neither seem to discuss
that.

The dangers around nonce reuse, the space overhead of storing the nonce,
the fact that single bit changes in the encrypted data don't propagate
seem not great?  Why aren't we using something like XTS? It has obvious
issues as wel, but CTR's weaknesses seem at least as great. And if we
want a MAC, then we don't want CTR either.

Greetings,

Andres Freund




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Michael Paquier
On Wed, May 26, 2021 at 07:44:03PM -0400, Alvaro Herrera wrote:
> On 2021-May-26, Tom Lane wrote:
>> Personally I won't touch 002_pg_dump.pl with a 10-foot pole, but if
>> somebody else wants to, have at it.
> 
> Nod.

Yeah, having an extra test for partitioned tables would be a good
idea.

>> Hm, there's this in compression.sql:
>> 
>> -- test LIKE INCLUDING COMPRESSION
>> CREATE TABLE cmdata2 (LIKE cmdata1 INCLUDING COMPRESSION);
>> \d+ cmdata2
>> 
>> Or did you mean the case with a partitioned table specifically?
> 
> Ah, I guess that's sufficient.  (The INCLUDING clause cannot be used to
> create a partition, actually.)

+column_compression:
+   COMPRESSION ColId   { $$ = $2; }
+   | COMPRESSION DEFAULT   { $$ =
pstrdup("default"); }
Could it be possible to have some tests for COMPRESSION DEFAULT?  It
seems to me that this should be documented as a supported keyword for
CREATE/ALTER TABLE.

 --changing column storage should not impact the compression method
 --but the data should not be compressed
 ALTER TABLE cmdata2 ALTER COLUMN f1 TYPE varchar;
+ALTER TABLE cmdata2 ALTER COLUMN f1 SET COMPRESSION pglz;
This comment needs a refresh?
--
Michael


signature.asc
Description: PGP signature


Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Michael Paquier  writes:
> Yeah, having an extra test for partitioned tables would be a good
> idea.

We do have some coverage already via the pg_upgrade test.

> Could it be possible to have some tests for COMPRESSION DEFAULT?  It
> seems to me that this should be documented as a supported keyword for
> CREATE/ALTER TABLE.

Uh, I did do both of those, no?  (The docs treat "default" as another
possible value, not a keyword, even though it's a keyword internally.)

>  --changing column storage should not impact the compression method
>  --but the data should not be compressed
>  ALTER TABLE cmdata2 ALTER COLUMN f1 TYPE varchar;
> +ALTER TABLE cmdata2 ALTER COLUMN f1 SET COMPRESSION pglz;
> This comment needs a refresh?

It's correct AFAICS.  Maybe it needs a bit of editing for clarity,
but I'm not sure how to make it better.  The point is that the
SET STORAGE just below disables compression of new values, no
matter what SET COMPRESSION says.

regards, tom lane




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-25 14:46:27 +0900, Michael Paquier wrote:
> That would work.  Your suggestion, as I understood it first, makes the
> code simpler by not using tup_values at all as the set of values[] is
> filled when the values and nulls are extracted.  So I have gone with
> this simplification, and applied the patch (moved a bit the comments
> while on it).

Hm. memsetting values_free() to zero repeatedly isn't quite free, nor is
iterating over all columns one more time. Note that values/isnull are
passed in, and allocated with an accurate size, so it's a bit odd to
then do a pessimally sized stack allocation. Efficiency aside, that just
seems a bit weird?

The efficiency bit is probably going to be swamped by the addition of
the compression handling, given the amount of additional work we're now
doing in in reform_and_rewrite_tuple(). I wonder if we should check how
much slower a VACUUM FULL of a table with a few varlena columns has
gotten vs 13.

Greetings,

Andres Freund




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Andres Freund  writes:
> The efficiency bit is probably going to be swamped by the addition of
> the compression handling, given the amount of additional work we're now
> doing in in reform_and_rewrite_tuple().

Only if the user has explicitly requested a change of compression, no?

regards, tom lane




Re: Race condition in recovery?

2021-05-26 Thread Kyotaro Horiguchi
At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar  wrote 
in 
> On Wed, 26 May 2021 at 10:06 PM, Robert Haas  wrote:
> 
> > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar 
> > wrote:
> > > I will check if there is any timing dependency in the test case.
> >
> > There is. I explained it in the second part of my email, which you may
> > have failed to notice.
> 
> 
> Sorry, my bad.  I got your point now.  I will change the test.

I didn't noticed that but that is actually possible to happen.


By the way I'm having a hard time understanding what was happening on
this thread.

In the very early in this thread I posted a test script that exactly
reproduces Dilip's case by starting from two standbys based on his
explanation. But *we* didn't understand what the original commit
ee994272ca intended and I understood that we wanted to know it.

So in the mail [1] and [2] I tried to describe what's going on around
the two issues.  Although I haven't have a response to [2], can I
think that we clarified the intention of ee994272ca?  And may I think
that we decided that we don't add a test for the commit?

Then it seems to me that Robert refound how to reproduce Dilip's case
using basebackup instead of using two standbys.  It is using a broken
archive_command with pg_basebackup -Xnone and I showed that the same
resulting state is available by pg_basebackup -Xstream/fetch clearing
pg_wal directory of the resulting backup including an explanation of
why.

*I* think that it is better to avoid to have the archive_command since
it seems to me that just unlinking some files seems simpler tha having
the broken archive_command.  However, since Robert ignored it, I guess
that Robert thinks that the broken archive_command is better than
that.

It my understanding above about the current status of this thread is
right?


FWIW, regarding to the name of the test script, putting aside what it
actually does, I proposed to place it as a part or
004_timeline_switch.pl because this issue is related to timeline
switching.


[1] 20210521.112105.27166595366072396.horikyota@gmail.com
[2] 
https://www.postgresql.org/message-id/20210524.113402.1922481024406047229.horikyota@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Michael Paquier
On Wed, May 26, 2021 at 08:35:46PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > The efficiency bit is probably going to be swamped by the addition of
> > the compression handling, given the amount of additional work we're now
> > doing in in reform_and_rewrite_tuple().
> 
> Only if the user has explicitly requested a change of compression, no?

Andres' point is that we'd still initialize and run through
values_free at the end of reform_and_rewrite_tuple() for each tuple
even if there no need to do so.  Well, we could control the
initialization and the free() checks at the end of the routine if we
know that there has been at least one detoasted value, at the expense
of making the code a bit less clear, of course.
--
Michael


signature.asc
Description: PGP signature


RE: [HACKERS] logical decoding of two-phase transactions

2021-05-26 Thread tanghy.f...@fujitsu.com
On Wed, May 26, 2021 10:13 PM Ajin Cherian  wrote:
> 
> I've attached a patch that fixes this issue. Do test and confirm.
> 

Thanks for your patch.
I have tested and confirmed that the issue I reported has been fixed.

Regards
Tang


Re: Speed up pg_checksums in cases where checksum already set

2021-05-26 Thread Julien Rouhaud
On Thu, May 27, 2021 at 5:24 AM Greg Sabino Mullane  wrote:
>
> The attached patch makes an optimization to pg_checksums which prevents 
> rewriting the block if the checksum is already what we expect. This can lead 
> to much faster runs in cases where it is already set (e.g. enabled -> 
> disabled -> enable, external helper process, interrupted runs, future 
> parallel processes). There is also an effort to not sync the data directory 
> if no changes were written. Finally, added a bit more output on how many 
> files were actually changed, e.g.:

I don't know how often this will actually help as probably people
aren't toggling the checksum state that often, but it seems like a
good idea overall.  The patch looks sensible to me.




RE: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread houzj.f...@fujitsu.com
From: Bharath Rupireddy 
Sent: Wednesday, May 26, 2021 7:22 PM
> Thanks for trying that out.
> 
> Please see the code around the use_fsm flag in RelationGetBufferForTuple for
> more understanding of the points below.
> 
> What happens if FSM is skipped i.e. myState->ti_options =
> TABLE_INSERT_SKIP_FSM;?
> 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple.
> 2) Each worker initially gets a block and keeps inserting into it until it is 
> full.
> When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace
> as use_fsm is false. It directly goes for relation extension and tries to 
> acquire
> relation extension lock with LockRelationForExtension. Note that the bulk
> extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is
> false.
> 3) After acquiring the relation extension lock, it adds an extra new block 
> with
> ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever
> extension we performed above, we always add at least one block to satisfy our
> own request." The tuple is inserted into this new block.
> 
> Basically, the workers can't look for the empty pages from the pages added by
> other workers, they keep doing the above steps in silos.
> 
> What happens if FSM is not skipped i.e. myState->ti_options = 0;?
> 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple.
> 2) Each worker initially gets a block and keeps inserting into it until it is 
> full.
> When the block is full, the worker looks for the page with free space in FSM
> GetPageWithFreeSpace as use_fsm is true.
> If it can't find any page with the required amount of free space, it goes for 
> bulk
> relation extension(RelationAddExtraBlocks) after acquiring relation extension
> lock with ConditionalLockRelationForExtension. Then the worker adds
> extraBlocks = Min(512, lockWaiters * 20); new blocks in
> RelationAddExtraBlocks and immediately updates the bottom level of FSM for
> each block (see the comment around RecordPageWithFreeSpace for why only
> the bottom level, not the entire FSM tree). After all the blocks are added, 
> then
> it updates the entire FSM tree FreeSpaceMapVacuumRange.
> 4) After the bulk extension, then the worker adds another block see the
> comment "In addition to whatever extension we performed above, we always
> add at least one block to satisfy our own request." and inserts tuple into 
> this
> new block.
> 
> Basically, the workers can benefit from the bulk extension of the relation and
> they always can look for the empty pages from the pages added by other
> workers. There are high chances that the blocks will be available after bulk
> extension. Having said that, if the added extra blocks are consumed by the
> workers so fast i.e. if the tuple sizes are big i.e very less tuples per 
> page, then,
> the bulk extension too can't help much and there will be more contention on
> the relation extension lock. Well, one might think to add more blocks at a 
> time,
> say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512,
> lockWaiters * 20);. This will work (i.e. we don't see any regression with 
> parallel
> inserts in CTAS patches), but it can't be a practical solution. Because the 
> total
> pages for the relation will be more with many pages having more free space.
> Furthermore, the future sequential scans on that relation might take a lot of
> time.
> 
> If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if
> (myState->is_parallel)), then it will be effective for leader i.e. leader 
> will not
> look for FSM, but all the workers will, because within if
> (myState->is_parallel_worker) in intorel_startup,
> myState->ti_options = 0; for workers.
> 
> I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 
> bytes
> each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn
> tuples) with leader participation where I'm seeing regression:
> 
> 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and
> workers, then my results are as follows:
> 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275
> 2) when myState->ti_options = 0; for both leader and workers, then my results
> are as follows:
> 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409
> I hope the above explanation and the test results should clarify the fact that
> skipping FSM doesn't solve the problem. Let me know if anything is not clear 
> or
> I'm missing something.

Thanks for the explanation.
I followed your above test steps and the below configuration, but my test 
results are a little different from yours.
I am not sure the exact reason, maybe because of the hardware..

Test INSERT 1000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) 
columns):
SERIAL: 22023.631 ms
PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms  [SKIP FSM]: 19381.474 ms
PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms   [SKIP FSM]: 1

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-26 20:35:46 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > The efficiency bit is probably going to be swamped by the addition of
> > the compression handling, given the amount of additional work we're now
> > doing in in reform_and_rewrite_tuple().
> 
> Only if the user has explicitly requested a change of compression, no?

Oh, it'll definitely be more expensive in that case - but that seems
fair game. What I was wondering about was whether VACUUM FULL would be
measurably slower, because we'll now call toast_get_compression_id() on
each varlena datum. It's pretty easy for VACUUM FULL to be CPU bound
already, and presumably this'll add a bit.

Greetings,

Andres Freund




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Michael Paquier
On Wed, May 26, 2021 at 06:54:15PM -0700, Andres Freund wrote:
> Oh, it'll definitely be more expensive in that case - but that seems
> fair game. What I was wondering about was whether VACUUM FULL would be
> measurably slower, because we'll now call toast_get_compression_id() on
> each varlena datum. It's pretty easy for VACUUM FULL to be CPU bound
> already, and presumably this'll add a bit.

This depends on the number of attributes, but I do see an extra 0.5%
__memmove_avx_unaligned_erms in reform_and_rewrite_tuple() for a
normal VACUUM FULL with a 1-int-column relation on a perf profile,
with rewrite_heap_tuple eating most of it as in the past, so that's
within the noise bandwidth if you measure the runtime.  What would be
the worst case here, a table with one text column made of non-NULL
still very short values?
--
Michael


signature.asc
Description: PGP signature


Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-26 18:54:15 -0700, Andres Freund wrote:
> On 2021-05-26 20:35:46 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > The efficiency bit is probably going to be swamped by the addition of
> > > the compression handling, given the amount of additional work we're now
> > > doing in in reform_and_rewrite_tuple().
> >
> > Only if the user has explicitly requested a change of compression, no?
>
> Oh, it'll definitely be more expensive in that case - but that seems
> fair game. What I was wondering about was whether VACUUM FULL would be
> measurably slower, because we'll now call toast_get_compression_id() on
> each varlena datum. It's pretty easy for VACUUM FULL to be CPU bound
> already, and presumably this'll add a bit.
>

CREATE UNLOGGED TABLE vacme_text(t01 text not null default 't01',t02 text not 
null default 't02',t03 text not null default 't03',t04 text not null default 
't04',t05 text not null default 't05',t06 text not null default 't06',t07 text 
not null default 't07',t08 text not null default 't08',t09 text not null 
default 't09',t10 text not null default 't10');
CREATE UNLOGGED TABLE vacme_int(i1 int not null default '1',i2 int not null 
default '2',i3 int not null default '3',i4 int not null default '4',i5 int not 
null default '5',i6 int not null default '6',i7 int not null default '7',i8 int 
not null default '8',i9 int not null default '9',i10 int not null default '10');
INSERT INTO vacme_text SELECT FROM generate_series(1, 1000);
INSERT INTO vacme_int SELECT FROM generate_series(1, 1000);

I ran 10 VACUUM FULLs on each, chose the shortest time:

unmodified
text:   3562ms
int:3037ms

after ifdefing out the compression handling:
text:   3175ms (x 0.88)
int:2894ms (x 0.95)

That's not *too* bad, but also not nothing

Greetings,

Andres Freund




Re: Add ZSON extension to /contrib/

2021-05-26 Thread Andrew Dunstan


On 5/26/21 5:29 PM, Bruce Momjian wrote:
> On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote:
>> Hi hackers,
>>
>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
>> The
>> extension introduces the new ZSON type, which is 100% compatible with JSONB 
>> but
>> uses a shared dictionary of strings most frequently used in given JSONB
>> documents for compression. These strings are replaced with integer IDs.
>> Afterward, PGLZ (and now LZ4) applies if the document is large enough by 
>> common
>> PostgreSQL logic. Under certain conditions (many large documents), this saves
>> disk space, memory and increases the overall performance. More details can be
>> found in README on GitHub.
> I think this is interesting because it is one of the few cases that
> allow compression outside of a single column.  Here is a list of
> compression options:
>
>   https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020
>   
>   1. single field
>   2. across rows in a single page
>   3. across rows in a single column
>   4. across all columns and rows in a table
>   5. across tables in a database
>   6. across databases
>
> While standard Postgres does #1, ZSON allows 2-5, assuming the data is
> in the ZSON data type.  I think this cross-field compression has great
> potential for cases where the data is not relational, or hasn't had time
> to be structured relationally.  It also opens questions of how to do
> this cleanly in a relational system.
>

I think we're going to get the best bang for the buck on doing 2, 3, and
4. If it's confined to a single table then we can put a dictionary in
something like a fork. Maybe given partitioning we want to be able to do
multi-table dictionaries, but that's less certain.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Speed up pg_checksums in cases where checksum already set

2021-05-26 Thread Michael Paquier
On Wed, May 26, 2021 at 05:23:55PM -0400, Greg Sabino Mullane wrote:
> The attached patch makes an optimization to pg_checksums which prevents
> rewriting the block if the checksum is already what we expect. This can
> lead to much faster runs in cases where it is already set (e.g. enabled ->
> disabled -> enable, external helper process, interrupted runs, future
> parallel processes).

Makes sense.

> There is also an effort to not sync the data directory
> if no changes were written. Finally, added a bit more output on how many
> files were actually changed, e.g.:

-if (do_sync)
+if (do_sync && total_files_modified)
 {
 pg_log_info("syncing data directory");
 fsync_pgdata(DataDir, PG_VERSION_NUM);

Here, I am on the edge.  It could be an advantage to force a flush of
the data folder anyway, no?  Say, all the pages have a correct
checksum and they are in the OS cache, but they may not have been
flushed yet.  That would emulate what initdb -S does already.

> Checksum operation completed
> Files scanned:   1236
> Blocks scanned:  23283
> Files modified:  38
> Blocks modified: 19194
> pg_checksums: syncing data directory
> pg_checksums: updating control file
> Checksums enabled in cluster

The addition of the number of files modified looks like an advantage.
--
Michael


signature.asc
Description: PGP signature


RE: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread tsunakawa.ta...@fujitsu.com
Thank you for the detailed analysis, I'll look into it too.  (The times have 
changed...)

From: Bharath Rupireddy 
> Well, one might think to add more blocks at a time, say
> Min(1024, lockWaiters * 128/256/512) than currently extraBlocks =
> Min(512, lockWaiters * 20);. This will work (i.e. we don't see any
> regression with parallel inserts in CTAS patches), but it can't be a
> practical solution. Because the total pages for the relation will be
> more with many pages having more free space. Furthermore, the future
> sequential scans on that relation might take a lot of time.

> Otherwise, the similar speed up can be observed when the BAS_BULKWRITE
> is increased a bit from the current 16MB to some other reasonable
> value. I earlier tried these experiments.
> 
> Otherwise, as I said in [1], we can also increase the number of extra
> blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than
> currently extraBlocks = Min(512, lockWaiters * 20);. This will also
> give some speedup and we don't see any regression with parallel
> inserts in CTAS patches.
> 
> But, I'm not so sure that the hackers will agree any of the above as a
> practical solution to the "relation extension" problem.

I think I understand your concern about resource consumption and impact on 
other concurrently running jobs (OLTP, data analysis.)

OTOH, what's the situation like when the user wants to run CTAS, and further, 
wants to speed it up by using parallelism?  isn't it okay to let the (parallel) 
CTAS use as much as it wants?  At least, I think we can provide another mode 
for it, like Oracle provides conditional path mode and direct path mode for 
INSERT and data loading.

What do we want to do to maximize parallel CTAS speedup if we were a bit 
unshackled from the current constraints (alignment with existing code, impact 
on other concurrent workloads)?

* Use as many shared buffers as possible to decrease WAL flush.
Otherwise, INSERT SELECT may be faster?

* Minimize relation extension (= increase the block count per extension)
posix_fallocate() would help too.

* Allocate added pages among parallel workers, and each worker fills pages to 
their full capacity.
The worker that extended the relation stores the page numbers of added pages in 
shared memory for parallel execution.  Each worker gets a page from there after 
waiting for the relation extension lock, instead of using FSM.
The last pages that the workers used will be filled halfway, but the amount of 
unused space should be low compared to the total table size.



Regards
Takayuki Tsunakawa



Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-27 11:07:53 +0900, Michael Paquier wrote:
> This depends on the number of attributes, but I do see an extra 0.5%
> __memmove_avx_unaligned_erms in reform_and_rewrite_tuple() for a
> normal VACUUM FULL with a 1-int-column relation on a perf profile,
> with rewrite_heap_tuple eating most of it as in the past, so that's
> within the noise bandwidth if you measure the runtime.
> What would be the worst case here, a table with one text column made
> of non-NULL still very short values?

I think you need a bunch of columns to see it, like in the benchmark I
just posted - I didn't test any other number of columns than 10 though.

Greetings,

Andres Freund




Re: Speed up pg_checksums in cases where checksum already set

2021-05-26 Thread Justin Pryzby
In one of the checksum patches, there was an understanding that the pages
should be written even if the checksum is correct, to handle replicas.

>From the v19 patch:
https://www.postgresql.org/message-id/F7AFCFCD-8F77-4546-8D42-C7F675A4B680%40yesql.se
+* Mark the buffer as dirty and force a full page write.  We 
have to
+* re-write the page to WAL even if the checksum hasn't changed,
+* because if there is a replica it might have a slightly 
different
+* version of the page with an invalid checksum, caused by 
unlogged
+* changes (e.g. hintbits) on the master happening while 
checksums
+* were off. This can happen if there was a valid checksum on 
the page
+* at one point in the past, so only when checksums are first 
on, then
+* off, and then turned on again.

pg_checksums(1) says:

|   When using a replication setup with tools which perform direct copies 
of relation file blocks (for example pg_rewind(1)), enabling or disabling 
checksums can lead to page
|   corruptions in the shape of incorrect checksums if the operation is not 
done consistently across all nodes. When enabling or disabling checksums in a 
replication setup, it
|   is thus recommended to stop all the clusters before switching them all 
consistently. Destroying all standbys, performing the operation on the primary 
and finally recreating
|   the standbys from scratch is also safe.

Does your patch complicate things for the "stop all the clusters before
switching them all" case?

-- 
Justin




Re: Remaining references to RecentGlobalXmin

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-24 15:47:48 +0900, Michael Paquier wrote:
> dc7420c2 has removed RecentGlobalXmin, but there are still references
> to it in the code, and a set of FIXME references, like this one in
> autovacuum.c (three in total):
> /*
>  * Start a transaction so we can access pg_database, and get a snapshot.
>  * We don't have a use for the snapshot itself, but we're interested in
>  * the secondary effect that it sets RecentGlobalXmin.  (This is critical
>  * for anything that reads heap pages, because HOT may decide to prune
>  * them even if the process doesn't attempt to modify any tuples.)
>  *
>  * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
>  * not pushed/active does not reliably prevent HOT pruning (->xmin could
>  * e.g. be cleared when cache invalidations are processed).
>  */
> 
> Wouldn't it be better to clean up that?

Sure, but the real cleanup necessary isn't to remove the reference to
RecentGlobalXmin nor specific to 14. It's that the code isn't right, and
hasn't been for a long time.
https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de

Greetings,

Andres Freund




Re: [HACKERS] logical decoding of two-phase transactions

2021-05-26 Thread Ajin Cherian
On Thu, May 27, 2021 at 11:20 AM tanghy.f...@fujitsu.com
 wrote:
>
> On Wed, May 26, 2021 10:13 PM Ajin Cherian  wrote:
> >
> > I've attached a patch that fixes this issue. Do test and confirm.
> >
>
> Thanks for your patch.
> I have tested and confirmed that the issue I reported has been fixed.

Thanks for confirmation. The problem seemed to be as you reported a
table not closed when a transaction was committed.
This seems to be because the function UpdateTwoPhaseState was
committing a transaction inside the function when the caller of
UpdateTwoPhaseState had
a table open in CreateSubscription. This function was newly included
in the CreateSubscription code, to handle the new use case of
two_phase being enabled on
create subscription if "copy_data = false". I don't think
CreateSubscription required this to be inside a transaction and the
committing of transaction
was only meant for where this function was originally created to be
used in the apply worker code (ApplyWorkerMain()).
So, I removed the committing of the transaction from inside the
function UpdateTwoPhaseState() and instead started and committed the
transaction
 prior to and after this function is invoked in the apply worker code.

regards,
Ajin Cherian
Fujitsu Australia




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Andres Freund  writes:
> That's not *too* bad, but also not nothing

The memsets seem to be easy to get rid of.  memset the array
to zeroes *once* before entering the per-tuple loop.  Then,
in the loop that looks for stuff to pfree, reset any entries
that are found to be set, thereby returning the array to all
zeroes for the next iteration.

I"m having a hard time though believing that the memset is the
main problem.  I'd think the pfree search loop is at least as
expensive.  Maybe skip that when not useful, by having a single
bool flag remembering whether any columns got detoasted in this
row?

regards, tom lane




RE: Skip partition tuple routing with constant partition key

2021-05-26 Thread houzj.f...@fujitsu.com
Hi amit-san

From: Amit Langote 
Sent: Wednesday, May 26, 2021 9:38 AM
> 
> Hou-san,
> 
> On Wed, May 26, 2021 at 10:05 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > Thanks for the explanation !
> > Yeah, we can get all the parent table info from PartitionTupleRouting when
> INSERT into a partitioned table.
> >
> > But I have two issues about using the information from PartitionTupleRouting
> to get the parent table's key expression:
> > 1) It seems we do not initialize the PartitionTupleRouting when directly
> INSERT into a partition(not a partitioned table).
> > I think it will be better we let the pre-compute-key_expression
> > feature to be used in all the possible cases, because it could bring nice
> performance improvement.
> >
> > 2) When INSERT into a partitioned table which is also a partition, the
> PartitionTupleRouting is initialized after the ExecPartitionCheck.
> 
> Hmm, do we really need to optimize ExecPartitionCheck() when partitions are
> directly inserted into?  As also came up earlier in the thread, we want to
> discourage users from doing that to begin with, so it doesn't make much sense
> to spend our effort on that case.
> 
> Optimizing ExecPartitionCheck(), specifically your idea of pre-computing the
> partition key expressions, only came up after finding that the earlier patch 
> to
> teach ExecFindPartition() to cache partitions may benefit from it.  IOW,
> optimizing ExecPartitionCheck() for its own sake does not seem worthwhile,
> especially not if we'd need to break module boundaries to make that happen.
> 
> Thoughts?

OK, I see the point, thanks for the explanation. 
Let try to move forward.

About teaching relcache about caching the target partition.

David-san suggested cache the partidx in PartitionDesc.
And it will need looping and checking the cached value at each level.
I was thinking can we cache a partidx list[1, 2 ,3], and then we can follow
the list to get the last partition and do the partition CHECK only for the last
partition. If any unexpected thing happen, we can return to the original table
and redo the tuple routing without using the cached index.
What do you think ?

Best regards,
houzj
 




Re: Decoding speculative insert with toast leaks memory

2021-05-26 Thread Peter Geoghegan
On Wed, Mar 24, 2021 at 10:34 PM Ashutosh Bapat
 wrote:
> Hi All,
> We saw OOM in a system where WAL sender consumed Gigabttes of memory
> which was never released. Upon investigation, we found out that there
> were many ReorderBufferToastHash memory contexts linked to
> ReorderBuffer context, together consuming gigs of memory. They were
> running INSERT ... ON CONFLICT .. among other things. A similar report
> at [1]

What is the relationship between this bug and commit 7259736a6e5,
dealt specifically with TOAST and speculative insertion resource
management issues within reorderbuffer.c? Amit?

-- 
Peter Geoghegan




RE: Fdw batch insert error out when set batch_size > 65535

2021-05-26 Thread houzj.f...@fujitsu.com
From: Bharath Rupireddy 
Sent: Wednesday, May 26, 2021 9:56 PM
> On Wed, May 26, 2021 at 6:36 PM Tomas Vondra
>  wrote:
> >
> > On 5/26/21 8:57 AM, Bharath Rupireddy wrote:
> > > On Tue, May 25, 2021 at 2:47 PM Bharath Rupireddy
> > >  wrote:
> > >>
> > >> On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com
> > >>  wrote:
> > >>> Thanks for the comments. I have addressed all comments to the v3
> patch.
> > >>
> > >> Thanks! The patch basically looks good to me except that it is
> > >> missing a commit message. I think it can be added now.
> > >
> > > With v3 patch, I observed failure in postgres_fdw test cases with
> > > insert query in prepared statements. Root cause is that in
> > > postgresGetForeignModifyBatchSize, fmstate can be null (see the
> > > existing code which handles for fmstate null cases). I fixed this,
> > > and added a commit message. PSA v4 patch.
> > >
> >
> > Thanks. In what situation is the fmstate NULL? If it is NULL, the
> > current code simply skips the line adjusting it. Doesn't that mean we
> > may not actually fix the bug in that case?
> 
> fmstate i.e. resultRelInfo->ri_FdwState is NULL for EXPLAIN without ANALYZE
> cases, below comment says it and we can't get the bug because we don't
> actually execute the insert statement. The bug occurs on the remote server
> when the insert query with those many query parameters is submitted to the
> remote server.

Agreed.
The "ri_FdwState" is initialized in postgresBeginForeignInsert or 
postgresBeginForeignModify.
I think the above functions are always invoked before getting the batch_size.

Only in EXPLAIN mode, it will not initialize the ri_FdwState.

/*
 * Do nothing in EXPLAIN (no ANALYZE) case.  resultRelInfo->ri_FdwState
 * stays NULL.
 */
if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
return;

Best regards,
houzj
 



Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Andres Freund
Hi,

On 2021-05-26 22:43:42 -0400, Tom Lane wrote:
> The memsets seem to be easy to get rid of.  memset the array
> to zeroes *once* before entering the per-tuple loop.  Then,
> in the loop that looks for stuff to pfree, reset any entries
> that are found to be set, thereby returning the array to all
> zeroes for the next iteration.

> I"m having a hard time though believing that the memset is the
> main problem.  I'd think the pfree search loop is at least as
> expensive.  Maybe skip that when not useful, by having a single
> bool flag remembering whether any columns got detoasted in this
> row?

Yea, I tested that - it does help in the integer case. But the bigger
contributors are the loop over the attributes, and especially the access
to the datum's compression method. Particularly the latter seems hard to
avoid.

Greetings,

Andres Freund




Re: Decoding speculative insert with toast leaks memory

2021-05-26 Thread Amit Kapila
On Thu, Mar 25, 2021 at 11:04 AM Ashutosh Bapat
 wrote:
>
> Hi All,
> We saw OOM in a system where WAL sender consumed Gigabttes of memory
> which was never released. Upon investigation, we found out that there
> were many ReorderBufferToastHash memory contexts linked to
> ReorderBuffer context, together consuming gigs of memory. They were
> running INSERT ... ON CONFLICT .. among other things. A similar report
> at [1]
>
..
>
> but by then we might have reused the toast_hash and thus can not be
> destroyed. But that isn't the problem since the reused toast_hash will
> be destroyed eventually.
>
> It's only when the change next to speculative insert is something
> other than INSERT/UPDATE/DELETE that we have to worry about a
> speculative insert that was never confirmed. So may be for those
> cases, we check whether specinsert != null and destroy toast_hash if
> it exists.
>

Can we consider the possibility to destroy the toast_hash in
ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the
clean up of memory till the end of stream or txn but there won't be
any memory leak.

-- 
With Regards,
Amit Kapila.




Re: Decoding speculative insert with toast leaks memory

2021-05-26 Thread Amit Kapila
On Thu, May 27, 2021 at 8:27 AM Peter Geoghegan  wrote:
>
> On Wed, Mar 24, 2021 at 10:34 PM Ashutosh Bapat
>  wrote:
> > Hi All,
> > We saw OOM in a system where WAL sender consumed Gigabttes of memory
> > which was never released. Upon investigation, we found out that there
> > were many ReorderBufferToastHash memory contexts linked to
> > ReorderBuffer context, together consuming gigs of memory. They were
> > running INSERT ... ON CONFLICT .. among other things. A similar report
> > at [1]
>
> What is the relationship between this bug and commit 7259736a6e5,
> dealt specifically with TOAST and speculative insertion resource
> management issues within reorderbuffer.c? Amit?
>

This seems to be a pre-existing bug. This should be reproduced in
PG-13 and or prior to that commit. Ashutosh can confirm?

-- 
With Regards,
Amit Kapila.




Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-26 Thread Tom Lane
Andres Freund  writes:
> Yea, I tested that - it does help in the integer case. But the bigger
> contributors are the loop over the attributes, and especially the access
> to the datum's compression method. Particularly the latter seems hard to
> avoid.

So maybe we should just dump the promise that VACUUM FULL will recompress
everything?  I'd be in favor of that actually, because it seems 100%
outside the charter of either VACUUM FULL or CLUSTER.

regards, tom lane




Re: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread Bharath Rupireddy
On Thu, May 27, 2021 at 7:12 AM houzj.f...@fujitsu.com
 wrote:
> I followed your above test steps and the below configuration, but my test 
> results are a little different from yours.
> I am not sure the exact reason, maybe because of the hardware..
>
> Test INSERT 1000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) 
> columns):
> SERIAL: 22023.631 ms
> PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms  [SKIP FSM]: 19381.474 ms
> PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms   [SKIP FSM]: 18381.305 ms

I'm not sure why there's a huge difference in the execution time, on
your system it just takes ~20sec whereas on my system(with SSD) it
takes ~115 sec. I hope you didn't try creating the unlogged table in
CTAS right? Just for reference, the exact use case I tried is at [1].
The configure command I used to build the postgres source code is at
[2]. I don't know whether I'm missing something here.

[1] case 4 - 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes
each) columns, tuple size 1064 bytes, 10mn tuples
DROP TABLE tenk1;
CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5
name, c6 name, c7 name, c8 name, c9 name, c10 name, c11 name, c12
name, c13 name, c14 name, c15 name, c16 name, c17 name, c18 name);
INSERT INTO tenk1 values(generate_series(1,1000),
generate_series(1,1000),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)),
upper(substring(md5(random()::varchar),2,8)));
explain analyze verbose create table test as select * from tenk1;

[2] ./configure --with-zlib --prefix=$PWD/inst/ --with-openssl
--with-readline  --with-libxml  > war.log && make -j 8 install >
war.log 2>&1 &

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




  1   2   >