Re: locking [user] catalog tables vs 2pc vs logical rep

2021-05-25 Thread Michael Paquier
On Mon, May 24, 2021 at 10:03:01AM +0530, Amit Kapila wrote:
> So, this appears to be an existing caveat of synchronous replication.
> If that is the case, I am not sure if it is a good idea to just block
> such ops for the prepared transaction. Also, what about other
> operations which acquire an exclusive lock on [user]_catalog_tables
> like:
> cluster pg_trigger using pg_class_oid_index, similarly cluster on any
> user_catalog_table, then the other problematic operation could
> truncate of user_catalog_table as is discussed in another thread [1].
> I think all such operations can block even with synchronous
> replication. I am not sure if we can create examples for all cases
> because for ex. we don't have use of user_catalog_tables in-core but
> maybe for others, we can try to create examples and see what happens?
> 
> If all such operations can block for synchronous replication and
> prepared transactions replication then we might want to document them
> as caveats at page:
> https://www.postgresql.org/docs/devel/logicaldecoding-synchronous.html
> and then also give the reference for these caveats at prepared
> transactions 
> page:https://www.postgresql.org/docs/devel/logicaldecoding-two-phase-commits.html
> 
> What do you think?

It seems to me that the 2PC issues on catalog tables and the issues
related to logical replication in synchonous mode are two distinct
things that need to be fixed separately.

The issue with LOCK taken on a catalog while a PREPARE TRANSACTION
holds locks around is bad enough in itself as it could lock down a
user from a cluster as long as the PREPARE TRANSACTION is not removed
from WAL (say the relation is critical for the connection startup).
This could be really disruptive for the user even if he tried to take
a lock on an object he owns, and the way to recover is not easy here,
and the way to recover involves either an old backup or worse,
pg_resetwal.

The second issue with logical replication is still disruptive, but it
looks to me more like a don't-do-it issue, and documenting the caveats
sounds fine enough.

Looking at the patch from upthread..

+   /*
+* Make note that we've locked a system table or an user catalog
+* table. This flag will be checked later during prepare transaction
+* to fail the prepare transaction.
+*/
+   if (lockstmt->mode >= ExclusiveLock &&
+   (IsCatalogRelationOid(reloid) ||
+RelationIsUsedAsCatalogTable(rel)))
+   MyXactFlags |= XACT_FLAGS_ACQUIREDEXCLUSIVELOCK_SYSREL;
I think that I'd just use IsCatalogRelationOid() here, and I'd be more
severe and restrict all attempts for any lock levels.  It seems to me
that this needs to happen within RangeVarCallbackForLockTable().
I would also rename the flag as just XACT_FLAGS_LOCKEDCATALOG.

+errmsg("cannot PREPARE a transaction that has an exclusive 
lock on user catalog/system table(s)")));
What about "cannot PREPARE a transaction that has locked a catalog
relation"?
--
Michael


signature.asc
Description: PGP signature


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Michael Paquier
On Tue, May 25, 2021 at 12:06:38PM +0530, Pavan Deolasee wrote:
> While working on an output plugin that uses streaming protocol, I hit an
> assertion failure. Further investigations revealed a possible bug in core
> Postgres. This must be new to PG14 since streaming support is new to this
> release. I extended the test_decoding regression test to demonstrate the
> failure. PFA

Thanks, Pavan.  I have added an open item for this one.
--
Michael


signature.asc
Description: PGP signature


Re: Different compression methods for FPI

2021-05-25 Thread Michael Paquier
On Mon, May 24, 2021 at 11:44:45PM -0500, Justin Pryzby wrote:
> The goal is to support 2+ "methods" (including "none"), which takes 4 bits, so
> may as well support 3 methods.
> 
> - uncompressed
> - pglz
> - lz4
> - zlib or zstd or ??

Let's make a proper study of all that and make a choice, the only
thing I am rather sure of is that pglz is bad compared to all the
others.  There is no point to argue as long as we don't know if any of
those candidates are suited for the job.

> This version:
> 0) repurposes the pre-existing GUC as an enum;
> 1) saves a bit (until zstd is included);
> 2) shows the compression in pg_waldump;
> 
> To support different compression levels, I think I'd change from an enum to
> string and an assign hook, which sets a pair of ints.

Hmm.  I am not really sure what you mean here, but let's keep that
in mind until we get more performance numbers.
--
Michael


signature.asc
Description: PGP signature


RE: Fdw batch insert error out when set batch_size > 65535

2021-05-25 Thread houzj.f...@fujitsu.com
From: Bharath Rupireddy 
Sent: Friday, May 21, 2021 5:03 PM
> On Fri, May 21, 2021 at 1:19 PM houzj.f...@fujitsu.com
>  wrote:
> > Attaching V2 patch. Please consider it for further review.
> 
> Thanks for the patch. Some more comments:
> 
> 1) Can fmstate->p_nums ever be 0 in postgresGetForeignModifyBatchSize?
> By any chance, if it can, I think instead of an assert, we can have something 
> like
> below, since we are using it in the division.
> if (fmstate->p_nums > 0 &&
>(batch_size * fmstate->p_nums > PQ_QUERY_PARAM_MAX_LIMIT))
> {
> batch_size = PQ_QUERY_PARAM_MAX_LIMIT / fmstate->p_nums;
> }
> Also, please remove the { and } for the above if condition, because for 1 line
> statements we don't normally use { and }
> 2) An empty line after the macro definition will be good.
> +#define PQ_QUERY_PARAM_MAX_LIMIT 65535
>  extern intPQsendQuery(PGconn *conn, const char *query);
> 3) I think we use:
> postgres_fdw not postgres_fdw
> batch_size not batch_size the number of columns *
> batch_size not the number of columns * batch_size
> +   overrides an option specified for the server. Note the final number
> +   of rows postgres_fdw inserts in a batch actually depends on the
> +   number of columns and the provided batch_size value. This is because
> +   of the limit the libpq protocol (which postgres_fdw uses to connect
> +   to a remote server) has on the number of query parameters that can
> +   be specified per query. For instance, if the number of columns
> * batch_size
> +   is more than the limit, then the libpq emits an error. But 
> postgres_fdw
> +   adjusts the batch_size to avoid this error.

Thanks for the comments. I have addressed all comments to the v3 patch.
BTW, Is the batch_size issue here an Open Item of PG14 ?

Best regards,
houzj


v3-0001-limit-the-fdw-batch-size.patch
Description: v3-0001-limit-the-fdw-batch-size.patch


RE: Parallel Inserts in CREATE TABLE AS

2021-05-25 Thread tsunakawa.ta...@fujitsu.com
Bharath-san, all,


Hmm, I didn't experience performance degradation on my poor-man's Linux VM (4 
CPU, 4 GB RAM, HDD)...

[benchmark preparation]
autovacuum = off
shared_buffers = 1GB
checkpoint_timeout = 1h
max_wal_size = 8GB
min_wal_size = 8GB
(other settings to enable parallelism)
CREATE UNLOGGED TABLE a (c char(1100));
INSERT INTO a SELECT i FROM generate_series(1, 30) i;
(the table size is 335 MB)

[benchmark]
CREATE TABLE b AS SELECT * FROM a;
DROP TABLE a;
CHECKPOINT;
(measure only CTAS)


[results]
parallel_leader_participation = off
  workers  time(ms)
  0  3921
  2  3290
  4  3132
parallel_leader_participation = on
  workers  time(ms)
  2  3266
  4  3247


Although this should be a controversial and may be crazy idea, the following 
change brought 4-11% speedup.  This is because I thought parallel workers might 
contend for WAL flush as a result of them using the limited ring buffer and 
flushing dirty buffers when the ring buffer is filled.  Can we take advantage 
of this?

[GetBulkInsertState]
/*  bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/
bistate->strategy = NULL;


[results]
parallel_leader_participation = off
  workers  time(ms)
  0  3695  (5% reduction)
  2  3135  (4% reduction)
  4  2767  (11% reduction)


Regards
Takayuki Tsunakawa



Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 12:12 PM Dilip Kumar  wrote:
>
> On Tue, May 25, 2021 at 12:06 PM Pavan Deolasee
>  wrote:
> >
> > Hi,
> >
> > While working on an output plugin that uses streaming protocol, I hit an 
> > assertion failure. Further investigations revealed a possible bug in core 
> > Postgres. This must be new to PG14 since streaming support is new to this 
> > release. I extended the test_decoding regression test to demonstrate the 
> > failure. PFA
> >
> > ```
> > 2021-05-25 11:32:19.493 IST client backend[68321] pg_regress/stream 
> > STATEMENT:  SELECT data FROM pg_logical_slot_get_changes('regression_slot', 
> > NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '
> > 1', 'stream-changes', '1');
> > TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line: 
> > 3476, PID: 68321)
> > ```
> >
> > From my preliminary analysis, it looks like we fail to adjust the memory 
> > accounting after streaming toasted tuples. More concretely, after 
> > `ReorderBufferProcessPartialChange()` processes the in-progress 
> > transaction,  `ReorderBufferTruncateTXN()` truncates the accumulated 
> > changed in the transaction, but fails to adjust the buffer size for toast 
> > chunks. Maybe we are missing a call to `ReorderBufferToastReset()` 
> > somewhere?
> >
> > From what I see, the assertion only triggers when data is inserted via COPY 
> > (multi-insert).
> >
> > Let me know if anything else is needed to reproduce this.
>
> Thanks, I will look into this and let you know if need some help.

I have identified the cause of the issue, basically, the reason is if
we are doing a multi insert operation we don't set the toast cleanup
until we get the last tuple of the xl_multi_insert [1].  Now, with
streaming, we can process the transaction in between the multi-insert
but while doing that the "change->data.tp.clear_toast_afterwards" is
set to false in all the tuples in this stream. And due to this we will
not clean up the toast.

One simple fix could be that we can just clean the toast memory if we
are processing in the streaming mode (as shown in the attached patch).
But maybe that is not the best-optimized solution, ideally, we can
save the toast until we process the last tuple of multi-insert in the
current stream, but I think that's not an easy thing to identify.

[1]
/*
* Reset toast reassembly state only after the last row in the last
* xl_multi_insert_tuple record emitted by one heap_multi_insert()
* call.
*/
if (xlrec->flags & XLH_INSERT_LAST_IN_MULTI &&
(i + 1) == xlrec->ntuples)
change->data.tp.clear_toast_afterwards = true;
else
change->data.tp.clear_toast_afterwards = false;

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..76ecfc0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2183,7 +2183,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * they're not required anymore. The creator of the
 		 * tuple tells us.
 		 */
-		if (change->data.tp.clear_toast_afterwards)
+		if (change->data.tp.clear_toast_afterwards || streaming)
 			ReorderBufferToastReset(rb, txn);
 	}
 	/* we're not interested in toast deletions */


RE: locking [user] catalog tables vs 2pc vs logical rep

2021-05-25 Thread osumi.takami...@fujitsu.com
On Monday, May 24, 2021 1:33 PM Amit Kapila  wrote:
> On Tue, Apr 20, 2021 at 9:57 AM vignesh C  wrote:
> >
> > This similar problem exists in case of synchronous replication setup
> > having synchronous_standby_names referring to the subscriber, when we
> > do the steps "begin;lock pg_class; insert into test1 values(10);
> > commit". In this case while decoding of commit, the commit will wait
> > while trying to acquire a lock on pg_class relation,
> >
> 
> So, this appears to be an existing caveat of synchronous replication.
> If that is the case, I am not sure if it is a good idea to just block such 
> ops for the
> prepared transaction. Also, what about other operations which acquire an
> exclusive lock on [user]_catalog_tables
> like:
> cluster pg_trigger using pg_class_oid_index, similarly cluster on any
> user_catalog_table, then the other problematic operation could truncate of
> user_catalog_table as is discussed in another thread [1].
> I think all such operations can block even with synchronous replication. I am 
> not
> sure if we can create examples for all cases because for ex. we don't have use
> of user_catalog_tables in-core but maybe for others, we can try to create
> examples and see what happens?
> 
> If all such operations can block for synchronous replication and prepared
> transactions replication then we might want to document them as caveats at
> page:
> https://www.postgresql.org/docs/devel/logicaldecoding-synchronous.html
> and then also give the reference for these caveats at prepared transactions
> page:https://www.postgresql.org/docs/devel/logicaldecoding-two-phase-com
> mits.html
> 
> What do you think?
I've checked the behavior of CLUSTER command
in synchronous mode, one of the examples above, as well.

IIUC, you meant pg_class, and
the deadlock happens when I run cluster commands on pg_class using its index in 
synchronous mode.
The command I used is "BEGIN; CLUSTER pg_class USING pg_class_oid_index; END;".
This deadlock comes from 2 processes, the backend to wait synchronization of 
the standby
and the walsender process which wants to take a lock on pg_class.
Therefore, I think we need to do something, at least documentation fix,
as you mentioned.

From the perspective of restating,
when I restart the locked pub with fast and immediate mode,
in both cases, the pub succeeded in restart and accepted
interactive psql connections. So, after the restart,
we are released from the lock.


Best Regards,
Takamichi Osumi



Re: Skipping logical replication transactions on subscriber side

2021-05-25 Thread Masahiko Sawada
On Tue, May 25, 2021 at 2:49 PM Bharath Rupireddy
 wrote:
>
> On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada  wrote:
> >
> > Hi all,
> >
> > If a logical replication worker cannot apply the change on the
> > subscriber for some reason (e.g., missing table or violating a
> > constraint, etc.), logical replication stops until the problem is
> > resolved. Ideally, we resolve the problem on the subscriber (e.g., by
> > creating the missing table or removing the conflicting data, etc.) but
> > occasionally a problem cannot be fixed and it may be necessary to skip
> > the entire transaction in question. Currently, we have two ways to
> > skip transactions: advancing the LSN of the replication origin on the
> > subscriber and advancing the LSN of the replication slot on the
> > publisher. But both ways might not be able to skip exactly one
> > transaction in question and end up skipping other transactions too.
>
> Does it mean pg_replication_origin_advance() can't skip exactly one
> txn? I'm not familiar with the function or never used it though, I was
> just searching for "how to skip a single txn in postgres" and ended up
> in [1]. Could you please give some more details on scenarios when we
> can't skip exactly one txn? Is there any other way to advance the LSN,
> something like directly updating the pg_replication_slots catalog?

Sorry, it's not impossible. Although the user mistakenly skips more
than one transaction by specifying a wrong LSN it's always possible to
skip an exact one transaction.

>
> [1] - https://www.postgresql.org/docs/devel/logical-replication-conflicts.html
>
> > I’d like to propose a way to skip the particular transaction on the
> > subscriber side. As the first step, a transaction can be specified to
> > be skipped by specifying remote XID on the subscriber. This feature
> > would need two sub-features: (1) a sub-feature for users to identify
> > the problem subscription and the problem transaction’s XID, and (2) a
> > sub-feature to skip the particular transaction to apply.
> >
> > For (1), I think the simplest way would be to put the details of the
> > change being applied in errcontext. For example, the following
> > errcontext shows the remote XID as well as the action name, the
> > relation name, and commit timestamp:
> >
> > ERROR:  duplicate key value violates unique constraint "test_pkey"
> > DETAIL:  Key (c)=(1) already exists.
> > CONTEXT:  during apply of "INSERT" for relation "public.test" in
> > transaction with xid 590 commit timestamp 2021-05-21
> > 14:32:02.134273+09
> >
> > The user can identify which remote XID has a problem during applying
> > the change (XID=590 in this case). As another idea, we can have a
> > statistics view for logical replication workers, showing information
> > of the last failure transaction.
>
> Agree with Amit on this. At times, it is difficult to look around in
> the server logs, so it will be better to have it in both places.
>
> > For (2), what I'm thinking is to add a new action to ALTER
> > SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP
> > TRANSACTION 590. Also, we can have actions to reset it; ALTER
> > SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the
> > XID to a new column of pg_subscription or a new catalog, having the
> > worker reread its subscription information. Once the worker skipped
> > the specified transaction, it resets the transaction to skip on the
> > catalog. The syntax allows users to specify one remote XID to skip. In
> > the future, it might be good if users can also specify multiple XIDs
> > (a range of XIDs or a list of XIDs, etc).
>
> What's it like skipping a txn with txn id? Is it that the particular
> txn is forced to commit or abort or just skipping some of the code in
> the apply worker?

What I'm thinking is to ignore the entire transaction with the
specified XID. IOW Logical replication workers don't even start the
transaction and ignore all changes associated with the XID.

>  IIUC, the behavior of RESET SKIP TRANSACTION is just
> to forget the txn id specified in SET SKIP TRANSACTION right?

Right. I proposed this RESET command for users to cancel the skipping behavior.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 1:26 PM Dilip Kumar  wrote:

>
>
> I have identified the cause of the issue, basically, the reason is if
> we are doing a multi insert operation we don't set the toast cleanup
> until we get the last tuple of the xl_multi_insert [1].  Now, with
> streaming, we can process the transaction in between the multi-insert
> but while doing that the "change->data.tp.clear_toast_afterwards" is
> set to false in all the tuples in this stream. And due to this we will
> not clean up the toast.
>

Thanks. That matches my understanding too.


>
> One simple fix could be that we can just clean the toast memory if we
> are processing in the streaming mode (as shown in the attached patch).
>

I am not entirely sure if it works correctly. I'd tried something similar,
but the downstream node using
my output plugin gets NULL values for the toast columns.  It's a bit hard
to demonstrate that with the
test_decoding plugin, but if you have some other mechanism to test that
change with an actual downstream
node receiving and applying changes, it will be useful to test with that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 1:45 PM Pavan Deolasee  wrote:
>
> I am not entirely sure if it works correctly. I'd tried something similar, 
> but the downstream node using
> my output plugin gets NULL values for the toast columns.  It's a bit hard to 
> demonstrate that with the
> test_decoding plugin, but if you have some other mechanism to test that 
> change with an actual downstream
> node receiving and applying changes, it will be useful to test with that.

Okay, I will test that.  Thanks.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Parallel Inserts in CREATE TABLE AS

2021-05-25 Thread tsunakawa.ta...@fujitsu.com
From: houzj.f...@fujitsu.com 
> + /*
> +  * We don't need to skip contacting FSM while inserting tuples
> for
> +  * parallel mode, while extending the relations, workers
> instead of
> +  * blocking on a page while another worker is inserting, can
> check the
> +  * FSM for another page that can accommodate the tuples.
> This results
> +  * in major benefit for parallel inserts.
> +  */
> + myState->ti_options = 0;
> 
> I am not quite sure that disabling the " SKIP FSM " in parallel worker will 
> bring
> performance gain.
> In my test environment, if I change this code to use option "
> TABLE_INSERT_SKIP_FSM ", then there
> seems no performance degradation.

+1, probably.

Does the code comment represent the situation like this?

1. Worker 1 is inserting into page 1.

2. Worker 2 tries to insert into page 1, but cannot acquire the buffer content 
lock of page 1 because worker 1 holds it.

3. Worker 2 looks up FSM to find a page with enough free space.

But isn't FSM still empty during CTAS?



Regards
Takayuki Tsunakawa



Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 1:49 PM Dilip Kumar  wrote:

> On Tue, May 25, 2021 at 1:45 PM Pavan Deolasee 
> wrote:
> >
> > I am not entirely sure if it works correctly. I'd tried something
> similar, but the downstream node using
> > my output plugin gets NULL values for the toast columns.  It's a bit
> hard to demonstrate that with the
> > test_decoding plugin, but if you have some other mechanism to test that
> change with an actual downstream
> > node receiving and applying changes, it will be useful to test with that.
>
> Okay, I will test that.  Thanks.
>
>
I modified test_decoding to print the tuples (like in the non-streamed
case) and included your proposed fix. PFA

When the transaction is streamed, I see:
```
+ opening a streamed block for transaction
+ table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb'
data[text]:'ccc'
+ table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd'
data[text]:'eee'
+ table public.toasted: INSERT: id[integer]:9003 other[text]:'bar'
data[text]:unchanged-toast-datum

```

For a non-streamed case, the `data[text]` column shows the actual data.
That probably manifests into NULL data when downstream handles it.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb.com


stream_toasted_txn_print_tuples.patch
Description: Binary data


sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-25 Thread Paul Guo
Hi hackers,

I found this when reading the related code. Here is the scenario:

bool
RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
bool retryOnError)

For the case retryOnError is true, the function would in loop call
ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
we can see if we run into the below branch, RegisterSyncRequest() will
need to loop until the checkpointer absorbs the existing requests so
ForwardSyncRequest() might hang for some time until a checkpoint
request is triggered. This scenario seems to be possible in theory
though the chance is not high.

ForwardSyncRequest():

if (CheckpointerShmem->checkpointer_pid == 0 ||
(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 !CompactCheckpointerRequestQueue()))
{
/*
 * Count the subset of writes where backends have to do their own
 * fsync
 */
if (!AmBackgroundWriterProcess())
CheckpointerShmem->num_backend_fsync++;
LWLockRelease(CheckpointerCommLock);
return false;
}

One fix is to add below similar code in RegisterSyncRequest(), trigger
a checkpoint for the scenario.

// checkpointer_triggered: variable for one trigger only.
if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
!checkpointer_triggered)
SetLatch(ProcGlobal->checkpointerLatch);

Any comments?

Regards,
Paul Guo (Vmware)




Re: pg_rewind fails if there is a read only file.

2021-05-25 Thread Paul Guo
> You seem to have missed my point completely. The answer to this problem
> IMNSHO is "Don't put a read-only file in the data directory".

Oh sorry. Well, if we really do not want this we may want to document this
and keep educating users, but meanwhile probably the product should be
more user friendly for the case, especially considering
that we know the fix would be trivial and I suspect it is inevitable that some
extensions put some read only files (e.g. credentials files) in pgdata.




Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 1:59 PM Pavan Deolasee  wrote:

>
> On Tue, May 25, 2021 at 1:49 PM Dilip Kumar  wrote:
>>
>> On Tue, May 25, 2021 at 1:45 PM Pavan Deolasee  
>> wrote:
>> >
>> > I am not entirely sure if it works correctly. I'd tried something similar, 
>> > but the downstream node using
>> > my output plugin gets NULL values for the toast columns.  It's a bit hard 
>> > to demonstrate that with the
>> > test_decoding plugin, but if you have some other mechanism to test that 
>> > change with an actual downstream
>> > node receiving and applying changes, it will be useful to test with that.
>>
>> Okay, I will test that.  Thanks.
>>
>
> I modified test_decoding to print the tuples (like in the non-streamed case) 
> and included your proposed fix. PFA
>
> When the transaction is streamed, I see:
> ```
> + opening a streamed block for transaction
> + table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb' 
> data[text]:'ccc'
> + table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd' 
> data[text]:'eee'
> + table public.toasted: INSERT: id[integer]:9003 other[text]:'bar' 
> data[text]:unchanged-toast-datum
> 
> ```
>
> For a non-streamed case, the `data[text]` column shows the actual data. That 
> probably manifests into NULL data when downstream handles it.

Yes, I am able to reproduce this, basically, until we get the last
tuple of the multi insert we can not clear the toast data otherwise we
can never form a complete tuple.  So the only possible fix I can think
of is to consider the multi-insert WAL without the final multi-insert
tuple as partial data then we will avoid streaming until we get the
complete WAL of one multi-insert.  I am working on the patch to fix
this, I will share that in some time.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: libpq_pipeline in tmp_install

2021-05-25 Thread Peter Eisentraut

On 19.05.21 19:35, Tom Lane wrote:

Alvaro Herrera  writes:

On 2021-May-19, Tom Lane wrote:

+1, except that you should add documentation for NO_INSTALL to the
list of definable symbols at the head of pgxs.mk, and to the list
in extend.sgml (compare that for NO_INSTALLCHECK).



I propose this.


WFM.


Thanks for the feedback.  I found that my proposal doesn't quite work, 
because "check" doesn't depend on "all" (anymore; see dbf2ec1a1c0), so 
running make check-world doesn't build the test program first.  The 
easiest workaround I found was to add an "install: all" line even for 
the NO_INSTALL case.  It's all a bit hackish, though.
From 24946ef3c8118b88b8df04d6bab051d0b3b20420 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 25 May 2021 11:01:58 +0200
Subject: [PATCH v2] Add NO_INSTALL option to pgxs

Apply in libpq_pipeline test makefile, so that the test file is not
installed into tmp_install.

Reviewed-by: Alvaro Herrera 
Reviewed-by: Tom Lane 
Discussion: 
https://www.postgresql.org/message-id/flat/cb9d16a6-760f-cd44-28d6-b091d5fb6ca7%40enterprisedb.com
---
 doc/src/sgml/extend.sgml | 10 ++
 src/makefiles/pgxs.mk| 13 +
 src/test/modules/libpq_pipeline/Makefile |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index ec95b4eb01..dae5737574 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -1672,6 +1672,16 @@ Extension Building Infrastructure
   
  
 
+ 
+  NO_INSTALL
+  
+   
+don't define an install target, useful for test
+modules that don't need their build products to be installed
+   
+  
+ 
+
  
   NO_INSTALLCHECK
   
diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index 271e7eaba8..0f71fa293d 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -49,6 +49,8 @@
 #   TAP_TESTS -- switch to enable TAP tests
 #   ISOLATION -- list of isolation test cases
 #   ISOLATION_OPTS -- additional switches to pass to pg_isolation_regress
+#   NO_INSTALL -- don't define an install target, useful for test modules
+# that don't need their build products to be installed
 #   NO_INSTALLCHECK -- don't define an installcheck target, useful e.g. if
 # tests require special configuration, or don't use pg_regress
 #   EXTRA_CLEAN -- extra files to remove in 'make clean'
@@ -227,6 +229,8 @@ all: all-lib
 endif # MODULE_big
 
 
+ifndef NO_INSTALL
+
 install: all installdirs
 ifneq (,$(EXTENSION))
$(INSTALL_DATA) $(addprefix $(srcdir)/, $(addsuffix .control, 
$(EXTENSION))) '$(DESTDIR)$(datadir)/extension/'
@@ -336,6 +340,15 @@ endif # with_llvm
 uninstall: uninstall-lib
 endif # MODULE_big
 
+else # NO_INSTALL
+
+# Need this so that temp-install builds artifacts not meant for
+# installation (Normally, check should depend on all, but we don't do
+# that because of parallel make risk (dbf2ec1a1c0).)
+install: all
+
+endif # NO_INSTALL
+
 
 clean:
 ifdef MODULES
diff --git a/src/test/modules/libpq_pipeline/Makefile 
b/src/test/modules/libpq_pipeline/Makefile
index b798f5fbbc..c9c5ae1beb 100644
--- a/src/test/modules/libpq_pipeline/Makefile
+++ b/src/test/modules/libpq_pipeline/Makefile
@@ -3,6 +3,8 @@
 PROGRAM = libpq_pipeline
 OBJS = libpq_pipeline.o
 
+NO_INSTALL = 1
+
 PG_CPPFLAGS = -I$(libpq_srcdir)
 PG_LIBS_INTERNAL += $(libpq_pgport)
 
-- 
2.31.1



RE: Added schema level support for publication.

2021-05-25 Thread tanghy.f...@fujitsu.com
On Monday, May 24, 2021 at 8:31 PM vignesh C  wrote:
> The earlier patch does not apply on the head. The v4 patch attached
> has the following changes:
> a) Rebased it on head. b) Removed pubschemas, pubtables columns and
> replaced it with pubtype in pg_publication table. c) List the schemas
> in describe publication. d) List the publication in list schemas. e)
> Add support for "FOR SCHEMA CURRENT_SCHEMA". f) Tab completion for
> "FOR SCHEMA" in create publication and alter publication. g) Included
> the newly added structure type to typedefs.lst

Thanks for your patch.

I ran "make check-world" after applying your patch but it failed on my machine. 
I saw the following log:
--
parallel group (2 tests):  subscription publication
 publication  ... FAILED  108 ms
 subscription ... ok   87 ms


diff -U3 
/home/fnst/data/postgresql_schema/postgresql/src/test/regress/expected/publication.out
 
/home/fnst/data/postgresql_schema/postgresql/src/test/regress/results/publication.out
--- 
/home/fnst/data/postgresql_schema/postgresql/src/test/regress/expected/publication.out
  2021-05-25 15:44:52.261683712 +0800
+++ 
/home/fnst/data/postgresql_schema/postgresql/src/test/regress/results/publication.out
   2021-05-25 15:48:41.393672595 +0800
@@ -359,10 +359,10 @@
 "public"

 \dn public;
- List of schemas
-  Name  |  Owner
-+-
- public | vignesh
+List of schemas
+  Name  | Owner
++---
+ public | fnst
 Publications:
 "testpub3_forschema"
--

I think the owner of CURRENT_SCHEMA should not be written into publication.out 
because the result is related to the user. 
Maybe we can use "ALTER SCHEMA public OWNER TO owner" to change its default 
owner before this test case. Thoughts?

Regards
Tang


Re: Fdw batch insert error out when set batch_size > 65535

2021-05-25 Thread Bharath Rupireddy
On Tue, May 25, 2021 at 1:08 PM houzj.f...@fujitsu.com
 wrote:
> Thanks for the comments. I have addressed all comments to the v3 patch.

Thanks! The patch basically looks good to me except that it is missing
a commit message. I think it can be added now.

> BTW, Is the batch_size issue here an Open Item of PG14 ?

IMO, the issue you found when setting batch_size to a too high value
is an extreme case testing of the feature added by commit b663a4136
that introduced the batch_size parameter. So, it's a bug to me. I
think you can add it as a bug in the commitfest and let the committers
take the call.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 2:34 PM Dilip Kumar  wrote:
>
> > When the transaction is streamed, I see:
> > ```
> > + opening a streamed block for transaction
> > + table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb' 
> > data[text]:'ccc'
> > + table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd' 
> > data[text]:'eee'
> > + table public.toasted: INSERT: id[integer]:9003 other[text]:'bar' 
> > data[text]:unchanged-toast-datum
> > 
> > ```
> >
> > For a non-streamed case, the `data[text]` column shows the actual data. 
> > That probably manifests into NULL data when downstream handles it.
>
> Yes, I am able to reproduce this, basically, until we get the last
> tuple of the multi insert we can not clear the toast data otherwise we
> can never form a complete tuple.  So the only possible fix I can think
> of is to consider the multi-insert WAL without the final multi-insert
> tuple as partial data then we will avoid streaming until we get the
> complete WAL of one multi-insert.  I am working on the patch to fix
> this, I will share that in some time.

The attached patch should fix the issue, now the output is like below

===
opening a streamed block for transaction
table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb'
data[text]:'ccc'
 table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd'
data[text]:'eee'
 table public.toasted: INSERT: id[integer]:9003 other[text]:'bar'
data[text]:'

===

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From fc58da41802f966dd269679921a5daa33d21a291 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 25 May 2021 14:51:45 +0530
Subject: [PATCH v1] Fix bug while streaming the multi-insert toast changes

While processing the multi-insert we can not clean the toast untill
we get the last insert of the multi-insert.  So mark the transaction
incomplete for streaming if we get any multi-insert change, and keep
it set until we get the last tuple of the multi-insert.
---
 src/backend/replication/logical/reorderbuffer.c | 11 +++
 src/include/replication/reorderbuffer.h | 11 ++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..b715992 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -732,6 +732,17 @@ ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 
 	/*
+	 * Set the multi insert bit whenever we get the clear_toast_afterwards
+	 * as false otherwise reset it if this is set.
+	 */
+	if (!change->data.tp.clear_toast_afterwards)
+	{
+		toptxn->txn_flags |= RBTXN_HAS_MULTI_INSERT;
+	}
+	else if rbtxn_has_multi_insert(toptxn)
+		toptxn->txn_flags &= ~RBTXN_HAS_MULTI_INSERT;
+
+	/*
 	 * Stream the transaction if it is serialized before and the changes are
 	 * now complete in the top-level transaction.
 	 *
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..cd36a8d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -176,6 +176,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_SPEC_INSERT 0x0040
 #define RBTXN_PREPARE 0x0080
 #define RBTXN_SKIPPED_PREPARE	  0x0100
+#define RBTXN_HAS_MULTI_INSERT0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -214,11 +215,19 @@ typedef struct ReorderBufferChange
 ( \
 	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
 )
+/*
+ * This transaction's changes has multi insert, without last
+ * multi insert tuple.
+ */
+#define rbtxn_has_multi_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_MULTI_INSERT) != 0 \
+)
 
 /* Check whether this transaction has an incomplete change. */
 #define rbtxn_has_incomplete_tuple(txn) \
 ( \
-	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) || rbtxn_has_multi_insert(txn) \
 )
 
 /*
-- 
1.8.3.1



Replace run-time error check with assertion

2021-05-25 Thread Peter Eisentraut


In the attached patch, the error message was checking that the 
structures returned from the parser matched expectations.  That's 
something we usually use assertions for, not a full user-facing error 
message.  So I replaced that with an assertion (hidden inside 
lfirst_node()).
From a30697758c475db2f4333cedbe6aad1e1acd0ee8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 25 May 2021 11:27:04 +0200
Subject: [PATCH] Replace run-time error check with assertion

The error message was checking that the structures returned from the
parser matched expectations.  That's something we usually use
assertions for, not a full user-facing error message.
---
 src/backend/commands/statscmds.c | 21 +
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index acae19176c..b244a0fbd7 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -220,26 +220,14 @@ CreateStatistics(CreateStatsStmt *stmt)
 */
foreach(cell, stmt->exprs)
{
-   Node   *expr = (Node *) lfirst(cell);
-   StatsElem  *selem;
-   HeapTuple   atttuple;
-   Form_pg_attribute attForm;
-   TypeCacheEntry *type;
-
-   /*
-* We should not get anything else than StatsElem, given the 
grammar.
-* But let's keep it as a safety.
-*/
-   if (!IsA(expr, StatsElem))
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("only simple column references 
and expressions are allowed in CREATE STATISTICS")));
-
-   selem = (StatsElem *) expr;
+   StatsElem  *selem = lfirst_node(StatsElem, cell);
 
if (selem->name)/* column reference */
{
char   *attname;
+   HeapTuple   atttuple;
+   Form_pg_attribute attForm;
+   TypeCacheEntry *type;
 
attname = selem->name;
 
@@ -273,6 +261,7 @@ CreateStatistics(CreateStatsStmt *stmt)
{
Node   *expr = selem->expr;
Oid atttype;
+   TypeCacheEntry *type;
 
Assert(expr != NULL);
 
-- 
2.31.1



Re: parallel vacuum - few questions on docs, comments and code

2021-05-25 Thread Amit Kapila
On Fri, May 21, 2021 at 3:33 PM Amit Kapila  wrote:
>
> On Fri, May 21, 2021 at 1:30 PM Bharath Rupireddy
>  wrote:
> >
> > On Fri, May 21, 2021 at 11:10 AM Amit Kapila  
> > wrote:
> > >
> > > If the patch changes the vacuumdb code as above then isn't it better
> > > to change the vacuumdb docs to reflect the same. See below part of
> > > vacuumdb docs:
> > > -P parallel_degree
> > > --parallel=parallel_degree
> >
> > Changed.
> >
> > > Also, can you please check if your patch works for PG-13 as well
> > > because I think it is better to backpatch it?
> >
> > I'm not sure about backpatching as it is not a critical bug fix. Since
> > the changes are user visible, I think that it's okay to backpatch.
> >
>
> Yes, as it is a user-visible change (though minor) so I thought it
> would be good to backpatch this. Does anyone else have any opinion on
> this?
>

Pushed!

-- 
With Regards,
Amit Kapila.




Re: Replication slot stats misgivings

2021-05-25 Thread Amit Kapila
On Mon, May 24, 2021 at 10:09 AM vignesh C  wrote:
>
> On Mon, May 24, 2021 at 9:38 AM Amit Kapila  wrote:
> >
> > On Thu, May 13, 2021 at 11:30 AM vignesh C  wrote:
> > >
> >
> > Do we want to update the information about pg_stat_replication_slots
> > at the following place in docs
> > https://www.postgresql.org/docs/devel/logicaldecoding-catalogs.html?
> >
> > If so, feel free to submit the patch for it?
>
> Adding it will be useful, the attached patch has the changes for the same.
>

Thanks for the patch, pushed.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Add `truncate` option to subscription commands

2021-05-25 Thread Amit Kapila
On Mon, May 24, 2021 at 2:10 PM Bharath Rupireddy
 wrote:
>
> On Mon, May 24, 2021 at 11:01 AM Amit Kapila  wrote:
> > > 1) Whether a table the sync worker is trying to truncate is having any
> > > referencing (foreign key) tables on the subscriber? If yes, whether
> > > all the referencing tables are present in the list of subscription
> > > tables (output of fetch_table_list)? In this case, the sync worker is
> > > truncating the primary key/referenced table.
> > >
> > > One way to solve the above problem is by storing the table oids of the
> > > subscription tables (output of fetch_table_list) in a new column in
> > > the pg_subscription catalog (like subpublications text[] column). In
> > > the sync worker, before truncation of a table, use
> > > heap_truncate_find_FKs to get all the referencing tables of the given
> > > table and get all the subscription tables from the new pg_subscription
> > > column. If all the referencing tables exist in the subscription
> > > tables, then truncate the table, otherwise don't, just skip it.
> > >
> >
> > Here, silently skipping doesn't seem like a good idea when the user
> > has asked to truncate the table. Shouldn't we allow it if the user has
> > provided say cascade with a truncate option?
>
> We could do that. In that case, the truncate option just can't be a
> boolean, but it has to be an enum accepting "restrict", "cascade",
> maybe "restart identity" or "continue identity" too. I have a concern
> here - what if the ExecuteTruncateGuts fails with the cascade option
> for whatever reason? Should the table sync worker be trapped in that
> error?
>

How is it any different from any other error we got during table sync
(say PK violation, out of memory, or any other such error)?

>
> > > There
> > > can be a problem here if there are many subscription tables, the size
> > > of the new column in pg_susbcription can be huge. However, we can
> > > choose to store the table ids in this new column only when the
> > > truncate option is specified.
> > >
> > > Another way is to let each table sync worker scan the
> > > pg_subscription_rel to get all the relations that belong to a
> > > subscription. But I felt this was costly.
> > >
> >
> > I feel it is better to use pg_subscription_rel especially because we
> > will do so when the user has given the truncate option and note that
> > we are already accessing it in sync worker for both reading and
> > writing. See LogicalRepSyncTableStart.
>
> Note that in pg_subscription_rel, there can exist multiple rows for
> each table for a given subscription. Say, t1 is a table that the sync
> worker is trying to truncate and copy. Say, t1_dep1, t1_dep2, t1_dep3
>  are the dependent tables (we can find these using
> heap_truncate_find_FKs). Now, we need to see if all the t1_dep1,
> t1_dep2, t1_dep3  tables are in the pg_suscription_rel with the
> same subscription id, then only we can delete all of them with
> EexecuteTruncateGuts() using cascade option. If any of the t1_depX is
> either not in the pg_subscription_rel or it is subscribed in another
> subscription, then is it okay if we scan pg_suscription_rel in a loop
> with t1_depX relid's?
>

Why do you need to search in a loop? There is an index for relid, subid.

> Isn't it costiler? Or since it is a cache
> lookup, maybe that's okay?
>
> /* Try finding the mapping. */
> tup = SearchSysCache2(SUBSCRIPTIONRELMAP,
>   ObjectIdGetDatum(relid),
>   ObjectIdGetDatum(subid));
>
> > One other problem discussed in this thread was what to do when the
> > same table is part of multiple subscriptions and the user has provided
> > a truncate option while operating on such a subscription.
>
> I think we can just skip truncating a table when it is part of
> multiple subscriptions. We can tell this clearly in the documentation.
>

Okay, I don't have any better ideas at this stage.

-- 
With Regards,
Amit Kapila.




Re: Logical Replication - behavior of TRUNCATE ... CASCADE

2021-05-25 Thread Amit Kapila
On Mon, May 24, 2021 at 2:18 PM Bharath Rupireddy
 wrote:
>
> On Mon, May 24, 2021 at 11:22 AM Amit Kapila  wrote:
> > I don't deny that this can allow some additional cases than we allow
> > today but was just not sure whether users really need it. If we want
> > to go with such an option then as mentioned earlier, we should
> > consider another proposal for subscriber-side truncate [1] because we
> > might need a cascade operation there as well but for a slightly
> > different purpose.
>
> I'm thinking how we can utilize the truncate option proposed at [1]
> for the idea here. Because, currently the truncate option(proposed at
> [1]) is boolean, (of course we can change this to take "cascade",
> "restrict" options). But how can we differentiate the usage of the
> truncate option at [1] for two purposes 1) for before copy
> data/initial table sync operation and 2) for the replication of
> TRUNCATE command as proposed here in this thread. Any thoughts?
>

I think we can do this as a separate patch. Let's not try to combine
both patches.

-- 
With Regards,
Amit Kapila.




Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 2:57 PM Dilip Kumar  wrote:

>
> >
> > Yes, I am able to reproduce this, basically, until we get the last
> > tuple of the multi insert we can not clear the toast data otherwise we
> > can never form a complete tuple.  So the only possible fix I can think
> > of is to consider the multi-insert WAL without the final multi-insert
> > tuple as partial data then we will avoid streaming until we get the
> > complete WAL of one multi-insert.  I am working on the patch to fix
> > this, I will share that in some time.
>
> The attached patch should fix the issue, now the output is like below
>
>
Thanks. This looks fine to me. We should still be able to stream
multi-insert transactions (COPY) as and when the copy buffer becomes full
and is flushed. That seems to be a reasonable restriction to me.

We should incorporate the regression test in the final patch. I am not
entirely sure if what I have done is acceptable (or even works in
all scenarios). We could possibly have a long list of tuples instead of
doing the exponential magic. Or we should consider lowering the min value
for logical_decoding_work_mem and run these tests with a much lower value.
In fact, that's how I caught the problem in the first place. I had
deliberately lowered the value to 1kB so that streaming code kicks in very
often and even for small transactions.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 3:33 PM Pavan Deolasee  wrote:

>> The attached patch should fix the issue, now the output is like below
>>
>
> Thanks. This looks fine to me. We should still be able to stream multi-insert 
> transactions (COPY) as and when the copy buffer becomes full and is flushed. 
> That seems to be a reasonable restriction to me.
>
> We should incorporate the regression test in the final patch. I am not 
> entirely sure if what I have done is acceptable (or even works in all 
> scenarios). We could possibly have a long list of tuples instead of doing the 
> exponential magic. Or we should consider lowering the min value for 
> logical_decoding_work_mem and run these tests with a much lower value. In 
> fact, that's how I caught the problem in the first place. I had deliberately 
> lowered the value to 1kB so that streaming code kicks in very often and even 
> for small transactions.

Thanks for confirming, I will come up with the test and add that to
the next version of the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Skipping logical replication transactions on subscriber side

2021-05-25 Thread Bharath Rupireddy
On Tue, May 25, 2021 at 1:44 PM Masahiko Sawada  wrote:
>
> On Tue, May 25, 2021 at 2:49 PM Bharath Rupireddy
>  wrote:
> >
> > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada  
> > wrote:
> > >
> > > Hi all,
> > >
> > > If a logical replication worker cannot apply the change on the
> > > subscriber for some reason (e.g., missing table or violating a
> > > constraint, etc.), logical replication stops until the problem is
> > > resolved. Ideally, we resolve the problem on the subscriber (e.g., by
> > > creating the missing table or removing the conflicting data, etc.) but
> > > occasionally a problem cannot be fixed and it may be necessary to skip
> > > the entire transaction in question. Currently, we have two ways to
> > > skip transactions: advancing the LSN of the replication origin on the
> > > subscriber and advancing the LSN of the replication slot on the
> > > publisher. But both ways might not be able to skip exactly one
> > > transaction in question and end up skipping other transactions too.
> >
> > Does it mean pg_replication_origin_advance() can't skip exactly one
> > txn? I'm not familiar with the function or never used it though, I was
> > just searching for "how to skip a single txn in postgres" and ended up
> > in [1]. Could you please give some more details on scenarios when we
> > can't skip exactly one txn? Is there any other way to advance the LSN,
> > something like directly updating the pg_replication_slots catalog?
>
> Sorry, it's not impossible. Although the user mistakenly skips more
> than one transaction by specifying a wrong LSN it's always possible to
> skip an exact one transaction.

IIUC, if the user specifies the "correct LSN", then it's possible to
skip exact txn for which the sync workers are unable to apply changes,
right?

How can the user get the LSN (which we call "correct LSN")? Is it from
pg_replication_slots? Or some other way?

If the user somehow can get the "correct LSN", can't the exact txn be
skipped using it with any of the existing ways, either using
pg_replication_origin_advance or any other ways?

If there's no way to get the "correct LSN", then why can't we just
print that LSN in the error context and/or in the new statistics view
for logical replication workers, so that any of the existing ways can
be used to skip exactly one txn?

IIUC, the feature proposed here guards against the users specifying
wrong LSN. If I'm right, what is the guarantee that users don't
specify the wrong txn id? Why can't we tell the users when a wrong LSN
is specified that "currently, an apply worker is failing to apply the
LSN , and you specified LSN , are you sure this is
intentional?"

Please correct me if I'm missing anything.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




How can the Aggregation move to the outer query

2021-05-25 Thread Andy Fan
My question can be demonstrated with the below example:

create table m1(a int, b int);
explain (costs off) select  (select count(*) filter (*where true*) from m1
t1)
from m1 t2 where t2.b % 2 = 1;

   QUERY PLAN
-
 Seq Scan on m1 t2
   Filter: ((b % 2) = 1)
   InitPlan 1 (returns $0)
 ->  Aggregate
   ->  Seq Scan on m1 t1
(5 rows)

The above is good to me. The aggregate is run in the subPlan/InitPlan.

explain (costs off) select  (select count(*) filter (*where t2.b = 1*) from
m1 t1)
from m1 t2 where t2.b % 2 = 1;

  QUERY PLAN
---
 Aggregate
   ->  Seq Scan on m1 t2
 Filter: ((b % 2) = 1)
   SubPlan 1
 ->  Seq Scan on m1 t1
(5 rows)

This one is too confusing to me since the Aggregate happens
on t2 rather than t1.  What happens here? Would this query
generate 1 row all the time like SELECT aggfunc(a) FROM t?

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


Add ZSON extension to /contrib/

2021-05-25 Thread Aleksander Alekseev
Hi hackers,

Back in 2016 while being at PostgresPro I developed the ZSON extension [1].
The extension introduces the new ZSON type, which is 100% compatible with
JSONB but uses a shared dictionary of strings most frequently used in given
JSONB documents for compression. These strings are replaced with integer
IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough
by common PostgreSQL logic. Under certain conditions (many large
documents), this saves disk space, memory and increases the overall
performance. More details can be found in README on GitHub.

The extension was accepted warmly and instantaneously I got several
requests to submit it to /contrib/ so people using Amazon RDS and similar
services could enjoy it too. Back then I was not sure if the extension is
mature enough and if it lacks any additional features required to solve the
real-world problems of the users. Time showed, however, that people are
happy with the extension as it is. There were several minor issues
discovered, but they were fixed back in 2017. The extension never
experienced any compatibility problems with the next major release of
PostgreSQL.

So my question is if the community may consider adding ZSON to /contrib/.
If this is the case I will add this thread to the nearest CF and submit a
corresponding patch.

[1]: https://github.com/postgrespro/zson

-- 
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale


Re: Adaptive Plan Sharing for PreparedStmt

2021-05-25 Thread Andy Fan
Hi,

On Thu, May 20, 2021 at 5:02 PM Tomas Vondra 
wrote:

> Hi,
>
> On 5/20/21 5:43 AM, Andy Fan wrote:
> > Currently we are using a custom/generic strategy to handle the data skew
> > issue. However, it doesn't work well all the time. For example:  SELECT *
> > FROM t WHERE a between $1 and $2. We assume the selectivity is 0.0025,
> > But users may provide a large range every time. Per our current strategy,
> > a generic plan will be chosen, Index scan on A will be chosen. oops..
> >
>
> Yeah, the current logic is rather simple, which is however somewhat on
> purpose, as it makes the planning very cheap. But it also means there's
> very little info to check/compare and so we may make mistakes.
>
> > I think Oracle's Adaptive Cursor sharing should work. First It calculate
> > the selectivity with the real bind values and generate/reuse different
> plan
> > based on the similarity of selectivity. The challenges I can think of
> > now are:
> > a). How to define the similarity.  b). How to adjust the similarity
> > during the
> > real run. for example, we say [1% ~ 10%] is similar. but we find
> > selectivity 20%
> > used the same plan as 10%. what should be done here.
> >
>
> IMO the big question is how expensive this would be. Calculating the
> selectivities for real values (i.e. for each query) is not expensive,
> but it's not free either. So even if we compare the selectivities in
> some way and skip the actual query planning, it's still going to impact
> the prepared statements.
>

That's true if we need to do this every time.  We may just need to do
this on some cases where the estimation is likely to be wrong, like a > $1;
or
a between $1 and $2;  In such cases, we just use the hard coded value
currently.


> Also, we currently don't have any mechanism to extract the selectivities
> from the whole query - not sure how complex that would be, as it may
> involve e.g. join selectivities.
>
> The idea in my mind is just checking the quals on base relations. like
t1.a > $1.
So for something like t1.a + t2.a > $1 will be ignored.


>
> As for how to define the similarity, I doubt there's a simple and
> sensible/reliable way to do that :-(
>
> I remember reading a paper about query planning in which the parameter
> space was divided into regions with the same plan. In this case the
> parameters are selectivities for all the query operations. So what we
> might do is this:
>
> 1) Run the first N queries and extract the selectivities / plans.
>
> 2) Build "clusters" of selecitivies with the same plan.
>
> 3) Before running a query, see if it the selectivities fall into one of
> the existing clusters. If yes, use the plan. If not, do regular
> planning, add it to the data set and repeat (2).
>
> I have no idea how expensive would this be, and I assume the "clusters"
> may have fairly complicated shapes (not simple convex regions).
>
>
Thanks for sharing this,  we do have lots of things to do here.  Your idea
should be a good place to start with.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Amit Kapila
On Tue, May 25, 2021 at 2:57 PM Dilip Kumar  wrote:
>
> On Tue, May 25, 2021 at 2:34 PM Dilip Kumar  wrote:
> >
> > > When the transaction is streamed, I see:
> > > ```
> > > + opening a streamed block for transaction
> > > + table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb' 
> > > data[text]:'ccc'
> > > + table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd' 
> > > data[text]:'eee'
> > > + table public.toasted: INSERT: id[integer]:9003 other[text]:'bar' 
> > > data[text]:unchanged-toast-datum
> > > 
> > > ```
> > >
> > > For a non-streamed case, the `data[text]` column shows the actual data. 
> > > That probably manifests into NULL data when downstream handles it.
> >
> > Yes, I am able to reproduce this, basically, until we get the last
> > tuple of the multi insert we can not clear the toast data otherwise we
> > can never form a complete tuple.  So the only possible fix I can think
> > of is to consider the multi-insert WAL without the final multi-insert
> > tuple as partial data then we will avoid streaming until we get the
> > complete WAL of one multi-insert.

Yeah, that sounds reasonable.

> >  I am working on the patch to fix
> > this, I will share that in some time.
>
> The attached patch should fix the issue, now the output is like below
>

Your patch will fix the reported scenario but I don't like the way
multi_insert flag is used to detect incomplete tuple. One problem
could be that even when there are no toast inserts, it won't allow to
stream unless we get the last tuple of multi insert WAL. How about
changing the code such that when we are clearing the toast flag, we
additionally check 'clear_toast_afterwards' flag?

-- 
With Regards,
Amit Kapila.




Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Peter Eisentraut

On 24.05.21 02:01, Tom Lane wrote:

I wrote:

I think we ought to fix this so that OUT-only arguments are ignored
when calling from SQL not plpgsql.


I'm working on a patch to make it act that way.  I've got some issues
yet to fix with named arguments (which seem rather undertested BTW,
since the patch is passing check-world even though I know it will
crash instantly on cases with CALL+named-args+out-only-args).

Before I spend too much time on it though, I wanted to mention that
it includes undoing 2453ea142's decision to include OUT arguments
in pg_proc.proargtypes for procedures (but not for any other kind of
routine).  I thought that was a terrible decision and I'm very happy
to revert it, but is anyone likely to complain loudly?


I don't understand why you want to change this.  The argument resolution 
of CALL is specified in the SQL standard; we shouldn't just make up our 
own system.





Re: Add ZSON extension to /contrib/

2021-05-25 Thread Magnus Hagander
On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
 wrote:
>
> Hi hackers,
>
> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
> The extension introduces the new ZSON type, which is 100% compatible with 
> JSONB but uses a shared dictionary of strings most frequently used in given 
> JSONB documents for compression. These strings are replaced with integer IDs. 
> Afterward, PGLZ (and now LZ4) applies if the document is large enough by 
> common PostgreSQL logic. Under certain conditions (many large documents), 
> this saves disk space, memory and increases the overall performance. More 
> details can be found in README on GitHub.
>
> The extension was accepted warmly and instantaneously I got several requests 
> to submit it to /contrib/ so people using Amazon RDS and similar services 
> could enjoy it too. Back then I was not sure if the extension is mature 
> enough and if it lacks any additional features required to solve the 
> real-world problems of the users. Time showed, however, that people are happy 
> with the extension as it is. There were several minor issues discovered, but 
> they were fixed back in 2017. The extension never experienced any 
> compatibility problems with the next major release of PostgreSQL.
>
> So my question is if the community may consider adding ZSON to /contrib/. If 
> this is the case I will add this thread to the nearest CF and submit a 
> corresponding patch.

If the extension is mature enough, why make it an extension in
contrib, and not instead either enhance the existing jsonb type with
it or make it a built-in type?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Fixing the docs for ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION

2021-05-25 Thread Peter Eisentraut

On 21.05.21 22:19, Mark Dilger wrote:

+ALTER SUBSCRIPTION mysubscription DROP PUBLICATION nosuchpub WITH (copy_data = 
false, refresh = false);
+ERROR:  unrecognized subscription parameter: "copy_data"
+ALTER SUBSCRIPTION mysubscription SET (copy_data = false, refresh = false);
+ERROR:  unrecognized subscription parameter: "copy_data"


Better wording might be something along the lines of "subscription 
parameter %s not supported in this context".  I'm not sure how easy this 
would be to implement, but with enough brute force it would surely be 
possible.





Re: How can the Aggregation move to the outer query

2021-05-25 Thread David Rowley
On Tue, 25 May 2021 at 22:28, Andy Fan  wrote:
>
> explain (costs off) select  (select count(*) filter (where t2.b = 1) from m1 
> t1)
> from m1 t2 where t2.b % 2 = 1;
>
>   QUERY PLAN
> ---
>  Aggregate
>->  Seq Scan on m1 t2
>  Filter: ((b % 2) = 1)
>SubPlan 1
>  ->  Seq Scan on m1 t1
> (5 rows)
>
> This one is too confusing to me since the Aggregate happens
> on t2 rather than t1.  What happens here? Would this query
> generate 1 row all the time like SELECT aggfunc(a) FROM t?

I think you're misreading the plan. There's a scan on t2 with a
subplan then an aggregate on top of that. Because you made the
subquery correlated by adding t2.b, it cannot be executed as an
initplan.

You might see what's going on better if you add VERBOSE to the EXPLAIN options.

David




Re: Possible pointer var TupleDesc rettupdesc used not initialized (src/backend/optimizer/util/clauses.c)

2021-05-25 Thread Ranier Vilela
Em seg., 24 de mai. de 2021 às 23:35, Zhihong Yu 
escreveu:

>
>
> On Mon, May 24, 2021 at 7:21 PM Ranier Vilela  wrote:
>
>> Em seg., 24 de mai. de 2021 às 22:42, Mark Dilger <
>> mark.dil...@enterprisedb.com> escreveu:
>>
>>>
>>>
>>> > On May 24, 2021, at 5:37 PM, Ranier Vilela 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Possible pointer TupleDesc rettupdesc used not initialized?
>>> >
>>> > if (!isNull) at line 4346 taking a true branch, the function
>>> check_sql_fn_retval at line 4448 can use rettupdesc uninitialized.
>>>
>>> Care to submit a patch?
>>>
>> Hi Mark, sorry but not.
>> I examined the code and I can't say what the correct value is for
>> rettupdesc.
>>
>
> Hi,
> It seems the following call would fill up value for rettupdesc :
>
> functypclass = get_expr_result_type((Node *) fexpr, NULL, &rettupdesc);
>
In short, do you suggest running half the else?
To do this, you need to fill fexpr correctly.
It will not always be a trivial solution.

regards,
Ranier Vilela


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 4:50 PM Amit Kapila  wrote:
>
> Your patch will fix the reported scenario but I don't like the way
> multi_insert flag is used to detect incomplete tuple. One problem
> could be that even when there are no toast inserts, it won't allow to
> stream unless we get the last tuple of multi insert WAL. How about
> changing the code such that when we are clearing the toast flag, we
> additionally check 'clear_toast_afterwards' flag?

Yes, that can be done, I will fix this in the next version of the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Skipping logical replication transactions on subscriber side

2021-05-25 Thread Masahiko Sawada
On Tue, May 25, 2021 at 7:21 PM Bharath Rupireddy
 wrote:
>
> On Tue, May 25, 2021 at 1:44 PM Masahiko Sawada  wrote:
> >
> > On Tue, May 25, 2021 at 2:49 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > If a logical replication worker cannot apply the change on the
> > > > subscriber for some reason (e.g., missing table or violating a
> > > > constraint, etc.), logical replication stops until the problem is
> > > > resolved. Ideally, we resolve the problem on the subscriber (e.g., by
> > > > creating the missing table or removing the conflicting data, etc.) but
> > > > occasionally a problem cannot be fixed and it may be necessary to skip
> > > > the entire transaction in question. Currently, we have two ways to
> > > > skip transactions: advancing the LSN of the replication origin on the
> > > > subscriber and advancing the LSN of the replication slot on the
> > > > publisher. But both ways might not be able to skip exactly one
> > > > transaction in question and end up skipping other transactions too.
> > >
> > > Does it mean pg_replication_origin_advance() can't skip exactly one
> > > txn? I'm not familiar with the function or never used it though, I was
> > > just searching for "how to skip a single txn in postgres" and ended up
> > > in [1]. Could you please give some more details on scenarios when we
> > > can't skip exactly one txn? Is there any other way to advance the LSN,
> > > something like directly updating the pg_replication_slots catalog?
> >
> > Sorry, it's not impossible. Although the user mistakenly skips more
> > than one transaction by specifying a wrong LSN it's always possible to
> > skip an exact one transaction.
>
> IIUC, if the user specifies the "correct LSN", then it's possible to
> skip exact txn for which the sync workers are unable to apply changes,
> right?
>
> How can the user get the LSN (which we call "correct LSN")? Is it from
> pg_replication_slots? Or some other way?
>
> If the user somehow can get the "correct LSN", can't the exact txn be
> skipped using it with any of the existing ways, either using
> pg_replication_origin_advance or any other ways?

One possible way I know is to copy the logical replication slot used
by the subscriber and peek at the changes to identify the correct LSN
(maybe there is another handy way though) . For example, suppose that
two transactions insert tuples as follows on the publisher:

TX-A: BEGIN;
TX-A: INSERT INTO test VALUES (1);
TX-B: BEGIN;
TX-B: INSERT INTO test VALUES (10);
TX-B: COMMIT;
TX-A: INSERT INTO test VALUES (2);
TX-A: COMMIT;

And suppose further that the insertion with value = 10 (by TX-A)
cannot be applied only on the subscriber due to unique constraint
violation. If we copy the slot by
pg_copy_logical_replication_slot('test_sub', 'copy_slot', true,
'test_decoding') , we can peek at those changes with LSN as follows:

=# select * from pg_logical_slot_peek_changes('copy', null, null) order by lsn;
lsn| xid |   data
---+-+--
 0/1911548 | 736 | BEGIN 736
 0/1911548 | 736 | table public.hoge: INSERT: c[integer]:1
 0/1911588 | 737 | BEGIN 737
 0/1911588 | 737 | table public.hoge: INSERT: c[integer]:10
 0/19115F8 | 737 | COMMIT 737
 0/1911630 | 736 | table public.hoge: INSERT: c[integer]:2
 0/19116A0 | 736 | COMMIT 736
(7 rows)

In this case, '0/19115F8' is the correct LSN to specify. We can
advance the replication origin to ' 0/19115F8' by
pg_replication_origin_advance() so that logical replication streams
transactions committed after ' 0/19115F8'. After the logical
replication restarting, it skips the transaction with xid = 737 but
replicates the transaction with xid = 736.

> If there's no way to get the "correct LSN", then why can't we just
> print that LSN in the error context and/or in the new statistics view
> for logical replication workers, so that any of the existing ways can
> be used to skip exactly one txn?

I think specifying XID to the subscription is more understandable for users.

>
> IIUC, the feature proposed here guards against the users specifying
> wrong LSN. If I'm right, what is the guarantee that users don't
> specify the wrong txn id? Why can't we tell the users when a wrong LSN
> is specified that "currently, an apply worker is failing to apply the
> LSN , and you specified LSN , are you sure this is
> intentional?"

With the initial idea, specifying the correct XID is the user's
responsibility. If they specify an old XID, the worker invalids it and
raises a warning to tell "the worker invalidated the specified XID as
it's too old". As the second idea, if we store the last failed XID
somewhere (e.g., a system catalog), the user can just specify to skip
that transaction. That is, instead of specifying the XID they could do
something like "ALTER SUBSCRIPTION test_sub RESOLVE CONFLICT BY SKIP".

Regards,

-- 
Masahiko Saw

Re: Transactions involving multiple postgres foreign servers, take 2

2021-05-25 Thread Masahiko Sawada
On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda  wrote:
>
>
>
> On 2021/05/21 13:45, Masahiko Sawada wrote:
> >
> > Yes. We also might need to be careful about the order of foreign
> > transaction resolution. I think we need to resolve foreign> transactions in 
> > arrival order at least within a foreign server.
>
> I agree it's better.
>
> (Although this is my interest...)
> Is it necessary? Although this idea seems to be for atomic visibility,
> 2PC can't realize that as you know. So, I wondered that.

I think it's for fairness. If a foreign transaction arrived earlier
gets put off so often for other foreign transactions arrived later due
to its index in FdwXactCtl->xacts, it’s not understandable for users
and not fair. I think it’s better to handle foreign transactions in
FIFO manner (although this problem exists even in the current code).

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: logical replication empty transactions

2021-05-25 Thread Ajin Cherian
On Tue, Apr 27, 2021 at 1:49 PM Ajin Cherian  wrote:

Rebased the patch as it was no longer applying.

regards,
Ajin Cherian
Fujitsu Australia


v6-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 5:46 PM Dilip Kumar  wrote:
>
> On Tue, May 25, 2021 at 4:50 PM Amit Kapila  wrote:
> >
> > Your patch will fix the reported scenario but I don't like the way
> > multi_insert flag is used to detect incomplete tuple. One problem
> > could be that even when there are no toast inserts, it won't allow to
> > stream unless we get the last tuple of multi insert WAL. How about
> > changing the code such that when we are clearing the toast flag, we
> > additionally check 'clear_toast_afterwards' flag?
>
> Yes, that can be done, I will fix this in the next version of the patch.

I have fixed as per the suggestion, and as per the offlist discussion,
I have merged the TOAST and SPEC insert flag and created a single
PARTIAL_CHANGE flag.
I have also added a test case for this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 37bd2166bc5036613ec2666ac06ae8106170550b Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 25 May 2021 14:51:45 +0530
Subject: [PATCH v2] Fix bug while streaming the multi-insert toast changes

While processing the multi-insert we can not clean the toast untill
we get the last insert of the multi-insert.  So mark the transaction
incomplete for streaming if we get any multi-insert change, and keep
it set until we get the last tuple of the multi-insert.
---
 contrib/test_decoding/expected/stream.out   | 41 +
 contrib/test_decoding/sql/stream.sql|  5 +++
 src/backend/replication/logical/reorderbuffer.c | 36 ++
 src/include/replication/reorderbuffer.h | 27 
 4 files changed, 68 insertions(+), 41 deletions(-)

diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index e1c3bc8..1e93955 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -82,6 +82,47 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  committing streamed transaction
 (13 rows)
 
+-- streaming test for toast with multi-insert
+SELECT 'psql -At -c "copy stream_test to stdout" ' || current_database() AS copy_command \gset
+COPY stream_test FROM program :'copy_command';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+   data   
+--
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(33 rows)
+
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index ce86c81..cad028b 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -26,5 +26,10 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
 
+-- streaming test for toast with multi-insert
+SELECT 'psql -At -c "copy stream_test to stdout" ' || current_database() AS copy_command \gset
+COPY stream_test FROM program :'copy_command';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..4401608 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -705,31 +705,27 @@ Reor

Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 6:43 PM Dilip Kumar  wrote:

>
> I have also added a test case for this.
>
>
Is that test good enough to trigger the original bug? In my experience, I
had to add a lot more tuples before the logical_decoding_work_mem
threshold was crossed and the streaming kicked in. I would suggest running
the test without the fix and check if the assertion hits. If so, we are
good to go.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 6:50 PM Pavan Deolasee  wrote:

> Is that test good enough to trigger the original bug? In my experience, I had 
> to add a lot more tuples before the logical_decoding_work_mem threshold was 
> crossed and the streaming kicked in. I would suggest running the test without 
> the fix and check if the assertion hits. If so, we are good to go.
>
Yes, it is reproducing without fix, I already tested it.  Basically, I
am using the "stream_test" table in "copy stream_test to stdout""
command which already has 20 toasted tuples, each of size 6000 bytes
so that is big enough to cross 64kB.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options

2021-05-25 Thread Alvaro Herrera
On 2021-May-25, Bharath Rupireddy wrote:

> On Mon, May 24, 2021 at 11:37 PM Alvaro Herrera  
> wrote:

> > There's no API limitation here, since that stuff is not user-visible, so
> > it doesn't matter.  If we ever need a 33rd option, we can change the
> > datatype to bits64.  Any extensions using it will have to be recompiled
> > across a major version jump anyway.
> 
> Thanks. I think there's no bits64 data type currently, I'm sure you
> meant we will define (when requirement arises) something like typedef
> uint64 bits64; Am I correct?

Right.

> I see that the commit a3dc926 and discussion at [1] say below respectively:
> "All the options of those commands are changed to use hex values
> rather than enums to reduce the risk of compatibility bugs when
> introducing new options."
> "My reasoning is that if you look at an enum value of this type,
> either say in a switch statement or a debugger, the enum value might
> not be any of the defined symbols. So that way you lose all the type
> checking that an enum might give you."
> 
> I'm not able to grasp what are the incompatibilities we can have if
> the enums are used as bit masks. It will be great if anyone throws
> some light on this?

The problem is that enum members have consecutive integers assigned by
the compiler.  Say you have an enum with three values for options.  They
get assigned 0, 1, and 2.  You can test for each option with "opt &
VAL_ONE" and "opt & VAL_TWO" and everything works -- each test returns
true when that specific option is set, and all is well.  Now if somebody
later adds a fourth option, it gets value 3.  When that option is set,
"opt & VAL_ONE" magically returns true, even though you did not set that
bit in your code.  So that becomes a bug.

Using hex values or bitshifting (rather than letting the compiler decide
its value in the enum) is a more robust way to ensure that the options
will not collide in that way.

So why not define the enum as a list, and give each option an exclusive
bit by bitshifting? For example,

enum options {
  OPT_ZERO  = 0,
  OPT_ONE   = 1 << 1,
  OPT_TWO   = 1 << 2,
  OPT_THREE = 1 << 3,
};

This should be okay, right?  Well, almost. The problem here is if you
want to have a variable where you set more than one option, you have to
use bit-and of the enum values ... and the resulting value is no longer
part of the enum.  A compiler would be understandably upset if you try
to pass that value in a variable of the enum datatype.

-- 
Álvaro Herrera   Valdivia, Chile
"Ed is the standard text editor."
  http://groups.google.com/group/alt.religion.emacs/msg/8d94ddab6a9b0ad3




Re: pg_rewind fails if there is a read only file.

2021-05-25 Thread Laurenz Albe
On Tue, 2021-05-25 at 16:57 +0800, Paul Guo wrote:
> > You seem to have missed my point completely. The answer to this problem
> > IMNSHO is "Don't put a read-only file in the data directory".
> 
> Oh sorry. Well, if we really do not want this we may want to document this
> and keep educating users, but meanwhile probably the product should be
> more user friendly for the case, especially considering
> that we know the fix would be trivial and I suspect it is inevitable that some
> extensions put some read only files (e.g. credentials files) in pgdata.

Good idea.  I suggest this documentation page:
https://www.postgresql.org/docs/current/creating-cluster.html

Perhaps something along the line of:

  It is not supported to manually create, delete or modify files in the
  data directory, unless they are configuration files or the documentation
  explicitly says otherwise (for example, recovery.signal
  for archive recovery).

Yours,
Laurenz Albe





Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 6:55 PM Dilip Kumar  wrote:

> On Tue, May 25, 2021 at 6:50 PM Pavan Deolasee 
> wrote:
>
> > Is that test good enough to trigger the original bug? In my experience,
> I had to add a lot more tuples before the logical_decoding_work_mem
> threshold was crossed and the streaming kicked in. I would suggest running
> the test without the fix and check if the assertion hits. If so, we are
> good to go.
> >
> Yes, it is reproducing without fix, I already tested it.  Basically, I
> am using the "stream_test" table in "copy stream_test to stdout""
> command which already has 20 toasted tuples, each of size 6000 bytes
> so that is big enough to cross 64kB.
>
>
Ok, great! Thanks for confirming.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com


Re: Skip partition tuple routing with constant partition key

2021-05-25 Thread Amit Langote
Hou-san,

On Mon, May 24, 2021 at 10:15 PM houzj.f...@fujitsu.com
 wrote:
> From: Amit Langote 
> Sent: Monday, May 24, 2021 4:27 PM
> > On Mon, May 24, 2021 at 10:31 AM houzj.f...@fujitsu.com
> >  wrote:
> > > Currently for a normal RANGE partition key it will first generate a
> > > CHECK expression like : [Keyexpression IS NOT NULL AND Keyexpression >
> > lowboud AND Keyexpression < lowboud].
> > > In this case, Keyexpression will be re-executed which will bring some
> > overhead.
> > >
> > > Instead, I think we can try to do the following step:
> > > 1)extract the Keyexpression from the CHECK expression 2)evaluate the
> > > key expression in advance 3)pass the result of key expression to do
> > > the partition CHECK.
> > > In this way ,we only execute the key expression once which looks more
> > efficient.
> >
> > I would have preferred this not to touch anything but ExecPartitionCheck(), 
> > at
> > least in the first version.  Especially, seeing that your patch touches
> > partbounds.c makes me a bit nervous, because the logic there is pretty
> > complicated to begin with.
>
> Agreed.
>
> > How about we start with something like the attached?  It's the same idea
> > AFAICS, but implemented with a smaller footprint.  We can consider teaching
> > relcache about this as the next step, if at all.  I haven't measured the
> > performance, but maybe it's not as fast as yours, so will need some 
> > fine-tuning.
> > Can you please give it a read?
>
> Thanks for the patch and It looks more compact than mine.
>
> After taking a quick look at the patch, I found a possible issue.
> Currently, the patch does not search the parent's partition key expression 
> recursively.
> For example, If we have multi-level partition:
> Table A is partition of Table B, Table B is partition of Table C.
> It looks like if insert into Table A , then we did not replace the key 
> expression which come from Table C.

Good catch!  Although, I was relieved to realize that it's not *wrong*
per se, as in it does not produce an incorrect result, but only
*slower* than if the patch was careful enough to replace all the
parents' key expressions.

> If we want to get the Table C, we might need to use pg_inherit, but it costs 
> too much to me.
> Instead, maybe we can use the existing logic which already scanned the 
> pg_inherit in function
> generate_partition_qual(). Although this change is out of 
> ExecPartitionCheck(). I think we'd better
> replace all the parents and grandparent...'s key expression.  Attaching a 
> demo patch based on the
> patch you posted earlier. I hope it will help.

Thanks.

Though again, I think we can do this without changing the relcache
interface, such as RelationGetPartitionQual().

PartitionTupleRouting has all the information that's needed here.
Each partitioned table involved in routing a tuple to the leaf
partition has a PartitionDispatch struct assigned to it.  That struct
contains the PartitionKey and we can access partexprs from there.  We
can arrange to assemble them into a single list that is saved to a
given partition's ResultRelInfo, that is, after converting the
expressions to have partition attribute numbers.  I tried that in the
attached updated patch; see the 0002-* patch.

Regarding the first patch to make ExecFindPartition() cache last used
partition, I noticed that it only worked for the bottom-most parent in
a multi-level partition tree, because only leaf partitions were
assigned to dispatch->lastPartitionInfo.  I have fixed the earlier
patch to also save non-leaf partitions and their corresponding
PartitionDispatch structs so that parents of all levels can use this
caching feature.  The patch has to become somewhat complex as a
result, but hopefully not too unreadable.

--
Amit Langote
EDB: http://www.enterprisedb.com


0001-ExecFindPartition-cache-last-used-partition-v3.patch
Description: Binary data


0002-ExecPartitionCheck-pre-compute-partition-key-express.patch
Description: Binary data


Re: How can the Aggregation move to the outer query

2021-05-25 Thread Andy Fan
On Tue, May 25, 2021 at 7:42 PM David Rowley  wrote:

> On Tue, 25 May 2021 at 22:28, Andy Fan  wrote:
> >
> > explain (costs off) select  (select count(*) filter (where t2.b = 1)
> from m1 t1)
> > from m1 t2 where t2.b % 2 = 1;
> >
> >   QUERY PLAN
> > ---
> >  Aggregate
> >->  Seq Scan on m1 t2
> >  Filter: ((b % 2) = 1)
> >SubPlan 1
> >  ->  Seq Scan on m1 t1
> > (5 rows)
> >
> > This one is too confusing to me since the Aggregate happens
> > on t2 rather than t1.  What happens here? Would this query
> > generate 1 row all the time like SELECT aggfunc(a) FROM t?
>
> I think you're misreading the plan. There's a scan on t2 with a
> subplan then an aggregate on top of that. Because you made the
> subquery correlated by adding t2.b, it cannot be executed as an
> initplan.
>
> You might see what's going on better if you add VERBOSE to the EXPLAIN
> options.
>
>
Thanks,  VERBOSE does provide more information.

 Aggregate
   Output: (SubPlan 1)
   ->  Seq Scan on public.m1 t2
 Output: t2.a, t2.b
 Filter: ((t2.b % 2) = 1)
   SubPlan 1
 ->  Seq Scan on public.m1 t1
   Output: count(*) FILTER (WHERE (t2.b = 1))
(8 rows)

I am still confused about the SubPlan1,  how can it output a
count(*) without an Aggregate under it (If this is not easy to
explain, I can try more by myself later).

But after all, I find this case when working on the UniqueKey stuff,
I have rule that if (query->hasAgg && !query->groupClause),  then
there are only 1 row for this query.   In the above case, the outer query
(t2) hasAgg=true and subplan's hasAgg=false, which looks not right
to me.  I think the hasAgg=true should be in the subquery and outer
query should have hasAgg=false.  anything I missed?

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


Re: pg_rewind fails if there is a read only file.

2021-05-25 Thread Andrew Dunstan


On 5/25/21 9:38 AM, Laurenz Albe wrote:
> On Tue, 2021-05-25 at 16:57 +0800, Paul Guo wrote:
>>> You seem to have missed my point completely. The answer to this problem
>>> IMNSHO is "Don't put a read-only file in the data directory".
>> Oh sorry. Well, if we really do not want this we may want to document this
>> and keep educating users, but meanwhile probably the product should be
>> more user friendly for the case, especially considering
>> that we know the fix would be trivial and I suspect it is inevitable that 
>> some
>> extensions put some read only files (e.g. credentials files) in pgdata.
> Good idea.  I suggest this documentation page:
> https://www.postgresql.org/docs/current/creating-cluster.html
>
> Perhaps something along the line of:
>
>   It is not supported to manually create, delete or modify files in the
>   data directory, unless they are configuration files or the documentation
>   explicitly says otherwise (for example, recovery.signal
>   for archive recovery).
>

Perhaps we should add that read-only files can be particularly problematic.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: How can the Aggregation move to the outer query

2021-05-25 Thread Tom Lane
David Rowley  writes:
> On Tue, 25 May 2021 at 22:28, Andy Fan  wrote:
>> explain (costs off) select  (select count(*) filter (where t2.b = 1) from m1 
>> t1)
>> from m1 t2 where t2.b % 2 = 1;
>> 
>> This one is too confusing to me since the Aggregate happens
>> on t2 rather than t1.  What happens here? Would this query
>> generate 1 row all the time like SELECT aggfunc(a) FROM t?

> I think you're misreading the plan. There's a scan on t2 with a
> subplan then an aggregate on top of that. Because you made the
> subquery correlated by adding t2.b, it cannot be executed as an
> initplan.

Also keep in mind that adding that filter clause completely changed
the meaning of the aggregate.  Aggregates belong to the lowest
query level containing any Var used in their arguments, so that
where in your original query the count(*) was an aggregate of the
subquery, now it's an aggregate of the outer query (and the subquery
now perceives it as a constant outer reference).  AFAIR this is per
SQL spec.

regards, tom lane




Re: pg_rewind fails if there is a read only file.

2021-05-25 Thread Tom Lane
Andrew Dunstan  writes:
> Perhaps we should add that read-only files can be particularly problematic.

Given the (legitimate, IMO) example of a read-only SSL key, I'm not
quite convinced that pg_rewind doesn't need to cope with this.

regards, tom lane




Re: Replace run-time error check with assertion

2021-05-25 Thread Tom Lane
Peter Eisentraut  writes:
> In the attached patch, the error message was checking that the 
> structures returned from the parser matched expectations.  That's 
> something we usually use assertions for, not a full user-facing error 
> message.  So I replaced that with an assertion (hidden inside 
> lfirst_node()).

Works for me.  It's certainly silly to use a translatable ereport
rather than elog for this.

Localizing those variables some more looks sane too.

regards, tom lane




Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Tom Lane
Peter Eisentraut  writes:
> On 24.05.21 02:01, Tom Lane wrote:
>>> I think we ought to fix this so that OUT-only arguments are ignored
>>> when calling from SQL not plpgsql.

> I don't understand why you want to change this.  The argument resolution 
> of CALL is specified in the SQL standard; we shouldn't just make up our 
> own system.

I don't really see how you can argue that the existing behavior is
more spec-compliant than what I'm suggesting.  What I read in the spec
(SQL:2021 10.4  SR 9) h) iii) 1)) is

1) If Pi is an output SQL parameter, then XAi shall be a .

(where  more or less reduces to "variable").
Now, sure, that's what we've got in plpgsql, and I'm not proposing
to change that.  But in plain SQL, as of HEAD, you are supposed to
write NULL, or a random literal, or indeed anything at all *except*
a variable.  How is that more standard-compliant than not writing
anything?

Also, one could argue that the behavior I'm suggesting is completely
spec-compliant if one assumes that the OUT parameters have some sort
of default, allowing them to be omitted from the call.

More generally, there are enough deviations from spec in what we do
to perform ambiguous-call resolution that it seems rather silly to
hang your hat on this particular point.

Now as against that, we are giving up a whole lot of consistency.
As of HEAD:

* The rules for what is a conflict of signatures are different
for functions and procedures.

* The rules for how to identify a target routine in ALTER, DROP,
etc are different for functions and procedures.  That's especially
nasty in ALTER/DROP ROUTINE, where we don't have a syntax cue
as to whether or not to ignore OUT parameters.

* The rules for how to call functions and procedures with OUT
parameters from SQL are different.

* Client code that looks at pg_proc.proargtypes is almost certainly
going to be broken.

I don't like any of those side-effects, and I don't want to pay
those prices for what seems to me to be a bogus claim of improved
spec compliance.

regards, tom lane




Re: Test of a partition with an incomplete detach has a timing issue

2021-05-25 Thread Alvaro Herrera
So I had a hard time reproducing the problem, until I realized that I
needed to limit the server to use only one CPU, and in addition run some
other stuff concurrently in the same server in order to keep it busy.
With that, I see about one failure every 10 runs.

So I start the server as "numactl -C0 postmaster", then another terminal
with an infinite loop doing "make -C src/test/regress installcheck-parallel";
and a third terminal doing this

while [ $? == 0 ]; do ../../../src/test/isolation/pg_isolation_regress 
--inputdir=/pgsql/source/master/src/test/isolation --outputdir=output_iso 
--bindir='/pgsql/install/master/bin' detach-partition-concurrently-3 
detach-partition-concurrently-3 detach-partition-concurrently-3 
detach-partition-concurrently-3 detach-partition-concurrently-3 
detach-partition-concurrently-3 detach-partition-concurrently-3 
detach-partition-concurrently-3 detach-partition-concurrently-3 
detach-partition-concurrently-3 detach-partition-concurrently-3 
detach-partition-concurrently-3 detach-partition-concurrently-3 ; done

With the test unpatched, I get about one failure in the set.

On 2021-May-24, Noah Misch wrote:

> What if we had a standard that the step after the cancel shall send a query to
> the backend that just received the cancel?  Something like:

Hmm ... I don't understand why this fixes the problem, but it
drastically reduces the probability.  Here's a complete patch.  I got
about one failure in 1000 instead of 1 in 10.  The new failure looks
like this:

diff -U3 
/pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out
 
/home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
--- 
/pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out
2021-05-25 11:12:42.333987835 -0400
+++ 
/home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
2021-05-25 11:19:03.714947775 -0400
@@ -13,7 +13,7 @@
 
 t  
 step s2detach: <... completed>
-error in steps s1cancel s2detach: ERROR:  canceling statement due to user 
request
+ERROR:  canceling statement due to user request
 step s2noop: UNLISTEN noop;
 step s1c: COMMIT;
 step s1describe: SELECT 'd3_listp' AS root, * FROM 
pg_partition_tree('d3_listp')


I find this a bit weird and I'm wondering if it could be an
isolationtester bug -- why is it not attributing the error message to
any steps?

The problem disappears completely if I add a sleep to the cancel query:

step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM 
d3_pid; }

I suppose a 0.01 second sleep is not going to be sufficient to close the
problem in slower animals, but I hesitate to propose a much longer sleep
because this test has 18 permutations so even a one second sleep adds
quite a lot of (mostly useless) test runtime.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: Test of a partition with an incomplete detach has a timing issue

2021-05-25 Thread Tom Lane
Alvaro Herrera  writes:
> The problem disappears completely if I add a sleep to the cancel query:
> step "s1cancel"   { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM 
> d3_pid; }
> I suppose a 0.01 second sleep is not going to be sufficient to close the
> problem in slower animals, but I hesitate to propose a much longer sleep
> because this test has 18 permutations so even a one second sleep adds
> quite a lot of (mostly useless) test runtime.

Yeah ... maybe 0.1 second is the right tradeoff?

Note that on slow (like CCA) animals, the extra query required by
Noah's suggestion is likely to take more than 0.1 second.

regards, tom lane




Re: Race condition in recovery?

2021-05-25 Thread Robert Haas
On Sun, May 23, 2021 at 12:08 PM Dilip Kumar  wrote:
> I have created a tap test based on Robert's test.sh script.  It
> reproduces the issue.  I am new with perl so this still needs some
> cleanup/improvement, but at least it shows the idea.

Thanks. I think this is the right idea but just needs a few adjustments.

I don't think that dynamically writing out a file into the current
working directory of the script is the right approach. Instead I think
we should be planning to check this file into the repository and then
have the test script find it. Now the trick is how to do that in a
portable way. I think we can probably use the same idea that the
pg_rewind tests use to find a perl module located in the test
directory.  That is:

use FindBin;

and then use $FindBin::RealBin to construct a path name to the executable, e.g.

$node_primary->append_conf(
   'postgresql.conf', qq(
archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"'
));

This avoids issues such as: leaving behind files if the script is
terminated, needing the current working directory to be writable,
possible permissions issues with the new file under Windows or
SE-Linux.

The restore_command needs to be "cp" on Linux but "copy" on Windows.
Maybe you can use PostgresNode.pm's enable_restoring? Or if that
doesn't work, then you need to mimic the logic, as
src/test/recovery/t/020_archive_status.pl does for archive_command.

Why do you set log_line_prefix? Is that needed?

Why are the nodes called standby_1 and cascade? Either use standby and
cascade or standby_1 and standby_2.

There is a comment that says "Create some content on primary and check
its presence in standby 1" but it only creates the content, and does
not check anything. I think we don't really need to do any of this,
but at least the code and the comment have to match.

Let's not call the command skip_cp. It's not very descriptive. If you
don't like recalcitrant_cp, then maybe something like cp_history_files
or so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: automatic analyze: readahead - add "IO read time" log message

2021-05-25 Thread Egor Rogov

Hi,

On 11.02.2021 01:10, Stephen Frost wrote:


Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:

On 05/02/2021 23:22, Stephen Frost wrote:

Unless there's anything else on this, I'll commit these sometime next
week.

One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().

Ah, yeah, of course.

Updated patch attached.



I happened to notice that get_tablespace_io_concurrency() is called 
instead of get_tablespace_maintenance_io_concurrency(). It doesn't look 
right, no?



Regards,
Egor Rogov.





Re: Possible pointer var TupleDesc rettupdesc used not initialized (src/backend/optimizer/util/clauses.c)

2021-05-25 Thread Tom Lane
Ranier Vilela  writes:
> Possible pointer TupleDesc rettupdesc used not initialized?
> if (!isNull) at line 4346 taking a true branch, the function
> check_sql_fn_retval at line 4448 can use rettupdesc uninitialized.

This seems to have been introduced by the SQL-function-body patch.

After some study, I concluded that the reason we haven't noticed
is that the case is nearly unreachable: check_sql_fn_retval never
consults the rettupdesc unless the function result type is composite
and the tlist length is more than one --- and we eliminated the latter
case earlier in inline_function.

There is an exception, namely if the single tlist item fails to
be coercible to the output type, but that's hard to get to given
that it'd have been checked while defining the SQL-body function.
I did manage to reproduce a problem after turning off
check_function_bodies so I could create a broken function.

In any case, inline_function has no business assuming that
check_sql_fn_retval doesn't need a valid value.

The simplest way to fix this seems to be to move the code that
creates "fexpr" and calls get_expr_result_type, so that we always
do that even for SQL-body cases.  We could alternatively use some
other way to obtain a result tupdesc in the SQL-body path; but
creating the dummy FuncExpr node is cheap enough that I don't
think it's worth contortions to avoid doing it.

regards, tom lane




Re: array_cat anycompatible change is breaking xversion upgrade tests

2021-05-25 Thread Justin Pryzby
On Thu, May 20, 2021 at 07:35:10PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Wed, Nov 04, 2020 at 07:43:51PM -0500, Tom Lane wrote:
> >> As was discussed in the thread leading up to that commit, modifying the
> >> signature of array_cat and friends could break user-defined operators
> >> and aggregates based on those functions.  It seems to me that the
> >> usability gain from this change is worth that cost, but it is causing
> >> an issue for xversion tests.
> 
> > But I think this should be called out as an incompatible change in the 
> > release
> > notes.
> 
> If it was not, yes it should be.

@Bruce, I propose:

Some system functions are changed to accept "anycompatiblearray" arguments.
This causes failures when restoring a database backup or running pg_restore if
there were aggregate functions defined using those functions with their
original argument types.

Such aggregate functions should be dropped before upgrade/restore and then
re-created afterwards using the "anycompatible" functions.  The affected
functions are: array_append, array_prepend, array_cat, array_position,
array_positions, array_remove, array_replace, and width_bucket.

(Re-defining the function before upgrading is possible when upgrading from v13,
only).

-- 
Justin




Re: How can the Aggregation move to the outer query

2021-05-25 Thread Andy Fan
On Tue, May 25, 2021 at 10:23 PM Tom Lane  wrote:

> David Rowley  writes:
> > On Tue, 25 May 2021 at 22:28, Andy Fan  wrote:
> >> explain (costs off) select  (select count(*) filter (where t2.b = 1)
> from m1 t1)
> >> from m1 t2 where t2.b % 2 = 1;
> >>
> >> This one is too confusing to me since the Aggregate happens
> >> on t2 rather than t1.  What happens here? Would this query
> >> generate 1 row all the time like SELECT aggfunc(a) FROM t?
>
> > I think you're misreading the plan. There's a scan on t2 with a
> > subplan then an aggregate on top of that. Because you made the
> > subquery correlated by adding t2.b, it cannot be executed as an
> > initplan.
>
> Also keep in mind that adding that filter clause completely changed
> the meaning of the aggregate.  Aggregates belong to the lowest
> query level containing any Var used in their arguments, so that
> where in your original query the count(*) was an aggregate of the
> subquery, now it's an aggregate of the outer query (and the subquery
> now perceives it as a constant outer reference).  AFAIR this is per
> SQL spec.
>

Well, finally I know it's an aggregate of the outer query..  Thank you for
the explanation!   so I would say the result set has 1 row for that query
all the time.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


Re: Test of a partition with an incomplete detach has a timing issue

2021-05-25 Thread Alvaro Herrera
On 2021-May-25, Tom Lane wrote:

> Alvaro Herrera  writes:
> > The problem disappears completely if I add a sleep to the cancel query:
> > step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM 
> > d3_pid; }
> > I suppose a 0.01 second sleep is not going to be sufficient to close the
> > problem in slower animals, but I hesitate to propose a much longer sleep
> > because this test has 18 permutations so even a one second sleep adds
> > quite a lot of (mostly useless) test runtime.
> 
> Yeah ... maybe 0.1 second is the right tradeoff?

Pushed with a 0.1 sleep, and some commentary.

> Note that on slow (like CCA) animals, the extra query required by
> Noah's suggestion is likely to take more than 0.1 second.

Hmm, but the sleep is to compete with the cancelling of detach, not with
the noop query.

I tried running the test under CCA here and it didn't fail, but of
course that's not a guarantee of anything since it only completed one
iteration.

-- 
Álvaro Herrera   Valdivia, Chile




Re: Possible pointer var TupleDesc rettupdesc used not initialized (src/backend/optimizer/util/clauses.c)

2021-05-25 Thread Ranier Vilela
Em ter., 25 de mai. de 2021 às 13:09, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > Possible pointer TupleDesc rettupdesc used not initialized?
> > if (!isNull) at line 4346 taking a true branch, the function
> > check_sql_fn_retval at line 4448 can use rettupdesc uninitialized.
>
> This seems to have been introduced by the SQL-function-body patch.
>
> After some study, I concluded that the reason we haven't noticed
> is that the case is nearly unreachable: check_sql_fn_retval never
> consults the rettupdesc unless the function result type is composite
> and the tlist length is more than one --- and we eliminated the latter
> case earlier in inline_function.
>
> There is an exception, namely if the single tlist item fails to
> be coercible to the output type, but that's hard to get to given
> that it'd have been checked while defining the SQL-body function.
> I did manage to reproduce a problem after turning off
> check_function_bodies so I could create a broken function.
>
> In any case, inline_function has no business assuming that
> check_sql_fn_retval doesn't need a valid value.
>
> The simplest way to fix this seems to be to move the code that
> creates "fexpr" and calls get_expr_result_type, so that we always
> do that even for SQL-body cases.  We could alternatively use some
> other way to obtain a result tupdesc in the SQL-body path; but
> creating the dummy FuncExpr node is cheap enough that I don't
> think it's worth contortions to avoid doing it.
>
Following the guidelines, I provided a patch.
But I did more than requested, removed redundant variable and reduced the
scope of two.

vcregress check pass fine.

regards,
Ranier Vilela


v1_fix_uninitialized_var_clauses.patch
Description: Binary data


Re: Possible pointer var TupleDesc rettupdesc used not initialized (src/backend/optimizer/util/clauses.c)

2021-05-25 Thread Tom Lane
Ranier Vilela  writes:
> Following the guidelines, I provided a patch.

Oh, I already pushed a fix, thanks.

regards, tom lane




Re: storing an explicit nonce

2021-05-25 Thread Andres Freund
Hi,

On 2021-05-25 12:46:45 -0400, Robert Haas wrote:
> This approach has a few disadvantages. For example, right now, we only
> need to WAL log hints for the first write to each page after a
> checkpoint, but in this approach, if the same page is written multiple
> times per checkpoint cycle, we'd need to log hints every time. In some
> workloads that could be quite expensive, especially if we log an FPI
> every time.

Yes. I think it'd likely be prohibitively expensive in some situations.


> So I would like to propose an alternative: store the nonce in the
> page. Now the next question is where to put it. I think that putting
> it into the page header would be far too invasive, so I propose that
> we instead store it at the end of the page, as part of the special
> space. That makes an awful lot of code not really notice that anything
> is different, because it always thought that the usable space on the
> page ended where the special space begins, and it doesn't really care
> where that is exactly. The code that knows about the special space
> might care a little bit, but whatever private data it's storing is
> going to be at the beginning of the special space, and the nonce would
> be stored - in this proposal - at the end of the special space. So it
> turns out that it doesn't really care that much either.

The obvious concerns are issues around binary upgrades for cases that
already use the special space? Are you planning to address that by not
having that path? Or by storing the nonce at the "start" of the special
space (i.e. [normal data][nonce][existing special])?

Is there an argument for generalizing the nonce approach for to replace
fake LSNs for unlogged relations?

Why is using pd_special better than finding space for a flag bit in the
header indicating whether it has a nonce? Using pd_special will burden
all code using special space, and maybe even some that does not (think
empty pages now occasionally having a non-zero pd_special), whereas
implementing it on the page level wouldn't quite have the same concerns.


> One thing that happens is that a bunch of values that used to be
> constant - like TOAST_INDEX_TARGET and GinDataPageMaxDataSize - become
> non-constant. I suggested to Bharath that he handle this by changing
> those macros to take the nonce size as an argument, which is what the
> patch does, although it missed pushing that idea down all the way in
> some obscure case (e.g. SIGLEN_MAX). That has the down side that we
> will now have more computation to do at runtime vs. compile-time. I am
> unclear whether there would be enough impact to get exercised about,
> but I'm hopeful that the answer is "no".
> 
> As written, the patch makes initdb take a --tde-nonce-size argument,
> but that's really just for demonstration purposes. I assume that, if
> we decide to go this way, we'd have an initdb option that selects
> whether to use encryption, or perhaps the specific encryption
> algorithm to be used, and then the nonce size would be computed based
> on that, or else set to 0 if encryption is not in use.

I do suspect having only the "no nonce" or "nonce is a compile time
constant" cases would be good performance wise. Stuff like

> +#define MaxHeapTupleSizeLimit  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \
> +sizeof(ItemIdData)))
> +#define MaxHeapTupleSize(tdeNonceSize)  (BLCKSZ - 
> MAXALIGN(SizeOfPageHeaderData + \
> +sizeof(ItemIdData)) - 
> MAXALIGN(tdeNonceSize))

won't be free.

Greetings,

Andres Freund




Re: Possible pointer var TupleDesc rettupdesc used not initialized (src/backend/optimizer/util/clauses.c)

2021-05-25 Thread Ranier Vilela
Em ter., 25 de mai. de 2021 às 14:35, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > Following the guidelines, I provided a patch.
>
> Oh, I already pushed a fix, thanks.
>
No problem!

Thank you.

regards,
Ranier Vilela


Access root->simple_rte_array instead of Query->rtable for 2 more cases.

2021-05-25 Thread Andy Fan
When I am understanding the relationship between Query->rtable and
root->simple_rte_array, I'd like to assume that Query->rtable should be
never used
when root->simple_rte_array is ready. I mainly checked two places,
make_one_rel and
create_plan with the below hacks.

{
  List *l = root->parse->rtable;
  root->parse->rtable = NIL;
  make_one_rel..  or create_plan_recurse..
  root->parse->rtable = l;
}


Then I found adjust_appendrel_attrs_mutator and infer_arbiter_indexes still
use it. The attached patch fixed it by replacing the rt_fetch with
planner_rt_fetch,
all the tests passed.


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


v1-0001-Use-planner_rt_fetch-instead-of-rt_fetch-when-roo.patch
Description: Binary data


Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Peter Eisentraut

On 25.05.21 17:20, Tom Lane wrote:

I don't really see how you can argue that the existing behavior is
more spec-compliant than what I'm suggesting.  What I read in the spec
(SQL:2021 10.4  SR 9) h) iii) 1)) is

 1) If Pi is an output SQL parameter, then XAi shall be a .

(where  more or less reduces to "variable").
Now, sure, that's what we've got in plpgsql, and I'm not proposing
to change that.  But in plain SQL, as of HEAD, you are supposed to
write NULL, or a random literal, or indeed anything at all *except*
a variable.  How is that more standard-compliant than not writing
anything?


I concede that the current implementation is not fully standards 
compliant in this respect.  Maybe we need to rethink how we can satisfy 
this better.  For example, in some other implementations, you write CALL 
p(?), (where ? is the parameter placeholder), so it's sort of an output 
parameter.  However, changing it so that the entire way the parameters 
are counted is different seems a much greater departure.



More generally, there are enough deviations from spec in what we do
to perform ambiguous-call resolution that it seems rather silly to
hang your hat on this particular point.


I don't know what you mean by this.  Some stuff is different in the 
details, but you *can* write conforming code if you avoid the extremely 
complicated cases.  With your proposal, everything is always different, 
and we might as well remove the CALL statement and name it something 
else because users migrating from other systems won't be able to use it 
properly.



Now as against that, we are giving up a whole lot of consistency.
As of HEAD:

* The rules for what is a conflict of signatures are different
for functions and procedures.


But that's the fault of the way it was done for functions.  That doesn't 
mean we have to repeat it for procedures.  I mean, sure it would be 
better if it were consistent.  But SQL-standard syntax should behave in 
SQL standard ways.  Creating, altering, and dropping procedures is meant 
to be portable between SQL implementations.  If we change this in subtle 
ways so that DROP PROCEDURE p(int, int) drops a different procedure in 
different SQL implementations, that seems super-dangerous and annoying.





Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Tom Lane
I wrote:
> * The rules for how to identify a target routine in ALTER, DROP,
> etc are different for functions and procedures.  That's especially
> nasty in ALTER/DROP ROUTINE, where we don't have a syntax cue
> as to whether or not to ignore OUT parameters.

Just to enlarge on that point a bit:

regression=# create function foo(int, out int) language sql
regression-# as 'select $1';
CREATE FUNCTION
regression=# create procedure foo(int, out int) language sql
regression-# as 'select $1';
CREATE PROCEDURE

IMO this should have failed, but since it doesn't:

regression=# drop routine foo(int, out int);
DROP ROUTINE

Which object was dropped, and what is the argument for that one
being the right one?

Experinentation shows that in HEAD, what is dropped is the procedure,
and indeed the DROP will fail if you try to use it on the function.
That is a compatibility break, because in previous versions this
worked:

regression=# create function foo(int, out int) language sql
as 'select $1';
CREATE FUNCTION
regression=# drop routine foo(int, out int);
DROP ROUTINE

The fact that you now have to be aware of these details to use
ALTER/DROP ROUTINE seems like a pretty serious loss of user
friendliness, as well as compatibility.

regards, tom lane




Re: storing an explicit nonce

2021-05-25 Thread Robert Haas
On Tue, May 25, 2021 at 1:37 PM Andres Freund  wrote:
> The obvious concerns are issues around binary upgrades for cases that
> already use the special space? Are you planning to address that by not
> having that path? Or by storing the nonce at the "start" of the special
> space (i.e. [normal data][nonce][existing special])?

Well, there aren't any existing encrypted clusters, so what is the
scenario exactly? Perhaps you are thinking that we'd have a pg_upgrade
option that would take an unencrypted cluster and encrypt all the
pages, without any other page format changes. If so, this design would
preclude that choice, because there might be no free space available.

> Is there an argument for generalizing the nonce approach for to replace
> fake LSNs for unlogged relations?

I hadn't thought about that. Maybe. But that would require including
the nonce always, rather than only when TDE is selected, or including
it always in some kinds of pages and only conditionally in others,
which seems more complex.

> Why is using pd_special better than finding space for a flag bit in the
> header indicating whether it has a nonce? Using pd_special will burden
> all code using special space, and maybe even some that does not (think
> empty pages now occasionally having a non-zero pd_special), whereas
> implementing it on the page level wouldn't quite have the same concerns.

Well, I think there's a lot of code that knows where the line pointer
array starts, and all those calculations will have to become more
complex at runtime if we put the nonce anywhere near the start of the
page. I think there are way fewer things that care about the end of
the page. I dislike the idea that every call to PageGetItem() would
need to know the nonce size - there are hundreds of those calls and
making them more expensive seems a lot worse than the stuff this patch
changes.

It's always possible that I'm confused here, either about what you are
proposing or how impactful it would actually be...

> I do suspect having only the "no nonce" or "nonce is a compile time
> constant" cases would be good performance wise. Stuff like
>
> > +#define MaxHeapTupleSizeLimit  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \
> > +sizeof(ItemIdData)))
> > +#define MaxHeapTupleSize(tdeNonceSize)  (BLCKSZ - 
> > MAXALIGN(SizeOfPageHeaderData + \
> > +sizeof(ItemIdData)) - 
> > MAXALIGN(tdeNonceSize))
>
> won't be free.

One question here is whether we're comfortable saying that the nonce
is entirely constant. I wasn't sure about that. It seems possible to
me that different encryption algorithms might want nonces of different
sizes, either now or in the future. I am not a cryptographer, but that
seemed like a bit of a limiting assumption. So Bharath and I decided
to make the POC cater to a fully variable-size nonce rather than
zero-or-some-constant. However, if the consensus is that
zero-or-some-constant is better, fair enough! The patch can certainly
be adjusted to cater to work that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Access root->simple_rte_array instead of Query->rtable for 2 more cases.

2021-05-25 Thread Tom Lane
Andy Fan  writes:
> When I am understanding the relationship between Query->rtable and
> root->simple_rte_array, I'd like to assume that Query->rtable should be
> never used
> when root->simple_rte_array is ready.

TBH, now that Lists are really arrays, there's basically no performance
advantage to be gained by fooling with this.  I've considered ripping
out simple_rte_array, but haven't felt that the code churn would be
worth it.

regards, tom lane




Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 12:46:45PM -0400, Robert Haas wrote:
> On Thu, Mar 18, 2021 at 2:59 PM Bruce Momjian  wrote:
> > > Ultimately, we need to make sure that LSNs aren't re-used.  There's two
> > > sources of LSNs today: those for relations which are being written into
> > > the WAL and those for relations which are not (UNLOGGED relations,
> > > specifically).  The 'minimal' WAL level introduces complications with
> >
> > Well, the story is a little more complex than that --- we currently have
> > four LSN uses:
> >
> > 1.  real LSNs for WAL-logged relfilenodes
> > 2.  real LSNs for GiST indexes for non-WAL-logged relfilenodes of 
> > permanenet relations
> > 3.  fake LSNs for GiST indexes for relfilenodes of non-permanenet relations
> > 4.  zero LSNs for non-GiST non-permanenet relations
> >
> > This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3
> > so the LSNs are always unique.
> 
> Hi!
> 
> This approach has a few disadvantages. For example, right now, we only
> need to WAL log hints for the first write to each page after a
> checkpoint, but in this approach, if the same page is written multiple
> times per checkpoint cycle, we'd need to log hints every time. In some
> workloads that could be quite expensive, especially if we log an FPI
> every time.

Well, if we create a separate nonce counter, we still need to make sure
it doesn't go backwards during a crash, so we have to WAL log it
somehow, perhaps at a certain interval like 1k and advance the counter
by 1k in case of crash recovery, like we do with the oid counter now, I
think.

The buffer encryption overhead is 2-4%, and WAL encryption is going to
add to that, so I thought hint bit logging overhead would be minimal
in comparison.

> Also, I think that all sorts of non-permanent relations currently get
> zero LSNs, not just GiST. Every unlogged table and every temporary
> table would need to use fake LSNs. Moreover, for unlogged tables, the
> buffer manager would need changes, because it is otherwise going to
> assume that anything it sees in the pd_lsn field other than a zero is
> a real LSN.

Have you looked at the code, specifically EncryptPage():


https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch

+   if (!relation_is_permanent && !is_gist_page_or_similar)
+   PageSetLSN(page, LSNForEncryption(relation_is_permanent));


It assigns an LSN to unlogged pages.  As far as the buffer manager
seeing fake LSNs that already happens for GiST indexes, so I just built
on that --- seemed to work fine.

> So I would like to propose an alternative: store the nonce in the
> page. Now the next question is where to put it. I think that putting
> it into the page header would be far too invasive, so I propose that
> we instead store it at the end of the page, as part of the special
> space. That makes an awful lot of code not really notice that anything
> is different, because it always thought that the usable space on the
> page ended where the special space begins, and it doesn't really care
> where that is exactly. The code that knows about the special space
> might care a little bit, but whatever private data it's storing is
> going to be at the beginning of the special space, and the nonce would
> be stored - in this proposal - at the end of the special space. So it
> turns out that it doesn't really care that much either.

I think the big problem with that is that it adds a new counter, with
new code, and it makes adding encryption offline, like we do for adding
checksums, pretty much impossible since the page might not have space
for a nonce.  It also makes the idea of adding encryption as part of a
pg_upgrade non-link mode also impossible, at least for me.  ;-)

I have to ask why we should consider adding it to the special space,
since my current version seems fine, and has minimal code impact, and
has some advantages over using the special space.  Is it because of the
WAL hint overhead, or for a cleaner API, or something else?

Also, I need help with all the XXX comments I have in my patches before
I can move forward:

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Patches

I stopped working on this to get beta out the door, but next week it
would be nice to continue on this.  However, I want to get this patch
into a state where everyone is happy with it, rather than adding more
code with an unclear future.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> One question here is whether we're comfortable saying that the nonce
> is entirely constant. I wasn't sure about that. It seems possible to
> me that different encryption algorithms might want nonces of different
> sizes, either now or in the future. I am not a cryptographer, but that
> seemed like a bit of a limiting assumption. So Bharath and I decided
> to make the POC cater to a fully variable-size nonce rather than
> zero-or-some-constant. However, if the consensus is that
> zero-or-some-constant is better, fair enough! The patch can certainly
> be adjusted to cater to work that way.

A 16-byte nonce is sufficient for AES and I doubt we will need anything
stronger than AES256 anytime soon.  Making the nonce variable length
seems it is just adding complexity for little purpose. 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Robert Haas
On Tue, May 25, 2021 at 2:20 PM Tom Lane  wrote:
> Just to enlarge on that point a bit:
>
> regression=# create function foo(int, out int) language sql
> regression-# as 'select $1';
> CREATE FUNCTION
> regression=# create procedure foo(int, out int) language sql
> regression-# as 'select $1';
> CREATE PROCEDURE
>
> IMO this should have failed, but since it doesn't:
>
> regression=# drop routine foo(int, out int);
> DROP ROUTINE
>
> Which object was dropped, and what is the argument for that one
> being the right one?
>
> Experinentation shows that in HEAD, what is dropped is the procedure,
> and indeed the DROP will fail if you try to use it on the function.
> That is a compatibility break, because in previous versions this
> worked:
>
> regression=# create function foo(int, out int) language sql
> as 'select $1';
> CREATE FUNCTION
> regression=# drop routine foo(int, out int);
> DROP ROUTINE
>
> The fact that you now have to be aware of these details to use
> ALTER/DROP ROUTINE seems like a pretty serious loss of user
> friendliness, as well as compatibility.

I'm also concerned about the behavior here. I noticed it when this
commit went in, and it seemed concerning to me then, and it still
does. Nevertheless, I'm not convinced that your proposal is an
improvement. Suppose we have foo(int, out int) and also foo(int).
Then, if I understand correctly, under your proposal, foo(4) will call
the former within plpgsql code, because in that context the OUT
parameters must be included, and the latter from SQL code, because in
that context they must be emitted. I suspect in practice what will
happen is that you'll end up with both interpretations even within the
body of a plpgsql function, because plpgsql functions tend to include
SQL queries where, I presume, the SQL interpretation must apply. It
seems that it will be very difficult for users to know which set of
rules apply in which contexts.

Now, that being said, the status quo is also pretty bad, because we
have one set of rules for functions and another for procedures. I
believe that users will expect those to behave in similar ways, and
will be sad and surprised when they don't.

But on the third hand, Peter is also correct when he says that there's
not much use in implementing standard features with non-standard
semantics. The fact that we've chosen to make OUT parameters do some
random thing that is not what other systems do is, indeed, not great
for migrations. So doubling down on that questionable choice is also
not great. In a green field I think we ought to go the other way and
make OUT parameters as consistent with the standard as we can, and
have that handling be the same for procedures and for functions, but
it seems impossible to imagine making such a large compatibility break
with our own previous releases, however much the spec may dictate it.

I don't see any really great choice here, but in some sense your
proposal seems like the worst of all the options. It does not reverse
the patch's choice to treat functions and procedures differently, so
users will still have to deal with that inconsistency. But in addition
the handling of procedures will itself be inconsistent based on
context.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Tom Lane
Peter Eisentraut  writes:
> On 25.05.21 17:20, Tom Lane wrote:
>> I don't really see how you can argue that the existing behavior is
>> more spec-compliant than what I'm suggesting.  What I read in the spec
>> (SQL:2021 10.4  SR 9) h) iii) 1)) is
>> 1) If Pi is an output SQL parameter, then XAi shall be a > specification>.

> I concede that the current implementation is not fully standards 
> compliant in this respect.  Maybe we need to rethink how we can satisfy 
> this better.  For example, in some other implementations, you write CALL 
> p(?), (where ? is the parameter placeholder), so it's sort of an output 
> parameter.  However, changing it so that the entire way the parameters 
> are counted is different seems a much greater departure.

I'd expect to be able to write something like that in contexts where
there's a reasonable way to name an output parameter.  Like, say,
plpgsql.  Or JDBC --- I think they already use a notation like that
for output parameters from functions, and transform it after the fact.
As things work in HEAD, they'll have to have a different special hack
for procedures than they do for functions.  But none of this applies
to bare-SQL CALL.

>> More generally, there are enough deviations from spec in what we do
>> to perform ambiguous-call resolution that it seems rather silly to
>> hang your hat on this particular point.

> I don't know what you mean by this.

Well, let's take an example.  If OUT parameters are part of the
signature, then I'm allowed to do this:

regression=# create procedure p1(in x int, out y int) 
regression-# language sql as 'select $1';
CREATE PROCEDURE
regression=# create procedure p1(in x int, out y float8)
language sql as 'select $1';
CREATE PROCEDURE
regression=# call p1(42, null);
 y  

 42
(1 row)

I'm surprised that that worked rather than throwing an ambiguity
error.  I wonder which procedure it called, and where in the spec
you can find chapter and verse saying that that one and not the other
one is right.

It gets even sillier though, because experimentation shows that it
was the int one that was preferred:

regression=# create or replace procedure p1(in x int, out y float8)
language sql as 'select $1+1';
CREATE PROCEDURE
regression=# call p1(42, null);
 y  

 42
(1 row)

That seems kind of backwards really, considering that float8 is
further up the numeric hierarchy.  But let's keep going:

regression=# create procedure p1(in x int, out y text)
language sql as 'select $1+2';
CREATE PROCEDURE
regression=# call p1(42, null);
 y  

 44
(1 row)

So text is preferred to either int or float8.  I know why that
happened: we have a preference for matching UNKNOWN to string types.
But I challenge you to provide any argument that this behavior is
spec-compliant.

More generally, the point I'm trying to make is that our rules
for resolving an ambiguous function differ in a whole lot of
details from what SQL says.  That ship sailed a couple of
decades ago, so I'm not excited about adopting a fundamentally
bad design in pursuit of trying to make one small detail of
that behavior slightly closer to SQL.

[ thinks a bit ]

A lot of what I'm exercised about here is not the question of
how many parameters we write in CALL, but the choice to redefine
proargtypes (and thereby change what is considered the routine's
signature).  With the infrastructure in the patch I proposed,
it'd be possible to revert the signature changes and still
write dummy output parameters in CALL -- we'd just make CALL
set include_out_parameters=true all the time.  I do not think that
solution is superior to what I did in the patch, but if we can't
have a meeting of the minds on CALL, doing that much would still
be an improvement.

regards, tom lane




Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 10:37:32AM -0700, Andres Freund wrote:
> The obvious concerns are issues around binary upgrades for cases that
> already use the special space? Are you planning to address that by not
> having that path? Or by storing the nonce at the "start" of the special
> space (i.e. [normal data][nonce][existing special])?
> 
> Is there an argument for generalizing the nonce approach for to replace
> fake LSNs for unlogged relations?
> 
> Why is using pd_special better than finding space for a flag bit in the
> header indicating whether it has a nonce? Using pd_special will burden
> all code using special space, and maybe even some that does not (think
> empty pages now occasionally having a non-zero pd_special), whereas
> implementing it on the page level wouldn't quite have the same concerns.

My code can already identify if the LSN is fake or not --- why can't we
build on that?  Can someone show that logging WAL hint bits causes
unacceptable overhead beyond the encryption overhead?  I don't think we
even know that since we don't know the overhead of encrypting WAL.

One crazy idea would be to not log WAL hints, but rather use an LSN
range that will never be valid for real LSNs, like the high bit being
set.  That special range would need to be WAL-logged, but again, perhaps
every 1k, and increment by 1k on a crash.

This discussion has cemented what I had already considered --- that
doing a separate nonce will make this feature less usable/upgradable,
and take it beyond my ability or desire to complete.

Ideally, what I would like to do is to resolve my XXX questions in my
patches, get everyone happy with what we have, then let me do the WAL
encryption.  We can then see if logging hint bits is significant
overhead, and if it is, go with a special LSN range for fake LSNs.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-25 Thread Robert Haas
On Tue, May 25, 2021 at 2:45 PM Bruce Momjian  wrote:
> Well, if we create a separate nonce counter, we still need to make sure
> it doesn't go backwards during a crash, so we have to WAL log it

I think we don't really need a global counter, do we? We could simply
increment the nonce every time we write the page. If we want to avoid
using the same IV for different pages, then 8 bytes of the nonce could
store a value that's different for every page, and the other 8 bytes
could store a counter. Presumably we won't manage to write the same
page more than 2^64 times, since LSNs are limited to be <2^64, and
those are consumed more than 1 byte at a time for every change to any
page anywhere.

> The buffer encryption overhead is 2-4%, and WAL encryption is going to
> add to that, so I thought hint bit logging overhead would be minimal
> in comparison.

I think it depends. If buffer evictions are rare, then it won't matter
much. But if they are common, then using the LSN as the nonce will add
a lot of overhead.

> Have you looked at the code, specifically EncryptPage():
>
> 
> https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch
>
> +   if (!relation_is_permanent && !is_gist_page_or_similar)
> +   PageSetLSN(page, LSNForEncryption(relation_is_permanent));
>
>
> It assigns an LSN to unlogged pages.  As far as the buffer manager
> seeing fake LSNs that already happens for GiST indexes, so I just built
> on that --- seemed to work fine.

I had not, but I don't see why this issue is specific to GiST rather
than common to every kind of unlogged and temporary relation.

> I have to ask why we should consider adding it to the special space,
> since my current version seems fine, and has minimal code impact, and
> has some advantages over using the special space.  Is it because of the
> WAL hint overhead, or for a cleaner API, or something else?

My concern is about the overhead, and also the code complexity. I
think that making sure that the LSN gets changed in all cases may be
fairly tricky.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Tom Lane
Robert Haas  writes:
> I'm also concerned about the behavior here. I noticed it when this
> commit went in, and it seemed concerning to me then, and it still
> does. Nevertheless, I'm not convinced that your proposal is an
> improvement. Suppose we have foo(int, out int) and also foo(int).
> Then, if I understand correctly, under your proposal, foo(4) will call
> the former within plpgsql code, because in that context the OUT
> parameters must be included, and the latter from SQL code, because in
> that context they must be emitted.

No, you misunderstand my proposal.  The thing that I most urgently
want to do is to prevent that situation from ever arising, by not
allowing those two procedures to coexist, just as you can't have
two functions with such signatures.

If procedures are required to have distinct signatures when considering
input parameters only, then a fortiori they are distinct when also
considering output parameters.  So my proposal cannot make a CALL
that includes output parameters ambiguous if it was not before.

> I don't see any really great choice here, but in some sense your
> proposal seems like the worst of all the options. It does not reverse
> the patch's choice to treat functions and procedures differently, so
> users will still have to deal with that inconsistency.

You're definitely confused, because reversing that choice is *exactly*
what I'm on about.  The question of whether SQL-level CALL should act
differently from plpgsql CALL is pretty secondary.

regards, tom lane




Re: pg_rewind fails if there is a read only file.

2021-05-25 Thread Andrew Dunstan


On 5/25/21 10:29 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Perhaps we should add that read-only files can be particularly problematic.
> Given the (legitimate, IMO) example of a read-only SSL key, I'm not
> quite convinced that pg_rewind doesn't need to cope with this.
>
>   


If we do decide to do something the question arises what should it do?
If we're to allow for it I'm wondering if the best thing would be simply
to ignore such a file.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 03:09:03PM -0400, Robert Haas wrote:
> On Tue, May 25, 2021 at 2:45 PM Bruce Momjian  wrote:
> > Well, if we create a separate nonce counter, we still need to make sure
> > it doesn't go backwards during a crash, so we have to WAL log it
> 
> I think we don't really need a global counter, do we? We could simply
> increment the nonce every time we write the page. If we want to avoid
> using the same IV for different pages, then 8 bytes of the nonce could
> store a value that's different for every page, and the other 8 bytes
> could store a counter. Presumably we won't manage to write the same
> page more than 2^64 times, since LSNs are limited to be <2^64, and
> those are consumed more than 1 byte at a time for every change to any
> page anywhere.

The issue we had here is what do you use as a special value for each
relation?  Where do you store it if it is not computed?  You can use a
global counter for the per-page nonce that doesn't change when the page
is updated, but that would still need to be a global counter.

Also, when you change hint bits, either you don't change the nonce/LSN,
and don't recrypt the page (and the hint bit changes are visible), or
you change the nonce and reencrypt the page, and you are then WAL
logging the page.  I don't see how having a nonce different from the LSN
helps here.

> > The buffer encryption overhead is 2-4%, and WAL encryption is going to
> > add to that, so I thought hint bit logging overhead would be minimal
> > in comparison.
> 
> I think it depends. If buffer evictions are rare, then it won't matter
> much. But if they are common, then using the LSN as the nonce will add
> a lot of overhead.

Well, see above.  A separate nonce somewhere else doesn't help much, as
I see it.

> > Have you looked at the code, specifically EncryptPage():
> >
> > 
> > https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch
> >
> > +   if (!relation_is_permanent && !is_gist_page_or_similar)
> > +   PageSetLSN(page, LSNForEncryption(relation_is_permanent));
> >
> >
> > It assigns an LSN to unlogged pages.  As far as the buffer manager
> > seeing fake LSNs that already happens for GiST indexes, so I just built
> > on that --- seemed to work fine.
> 
> I had not, but I don't see why this issue is specific to GiST rather
> than common to every kind of unlogged and temporary relation.
> 
> > I have to ask why we should consider adding it to the special space,
> > since my current version seems fine, and has minimal code impact, and
> > has some advantages over using the special space.  Is it because of the
> > WAL hint overhead, or for a cleaner API, or something else?
> 
> My concern is about the overhead, and also the code complexity. I
> think that making sure that the LSN gets changed in all cases may be
> fairly tricky.

Please look over the patch to see if I missed anything --- for me, it
seemed quite clear, and I am not an expert in that area of the code.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-05-25 Thread Alvaro Herrera
On 2021-May-24, Noah Misch wrote:

> prairiedog and wrasse failed a $SUBJECT test after this (commit 8aba932).
> Also, some non-CLOBBER_CACHE_ALWAYS animals failed a test before the fix.
> These IsolationCheck failures match detach-partition-concurrently[^\n]*FAILED:
> 
>   sysname   │  snapshot   │ branch │  
>bfurl  
> ┼─┼┼

Checking this list, these three failures can be explained by the
detach-partition-concurrently-3 that was just patched.

>  wrasse │ 2021-04-21 10:38:32 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-21%2010%3A38%3A32
>  prairiedog │ 2021-04-25 22:05:48 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-04-25%2022%3A05%3A48
>  wrasse │ 2021-05-11 03:43:40 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-05-11%2003%3A43%3A40

Next there's a bunch whose error message is the same that we had seen
earlier in this thread; these animals are all CLOBBER_CACHE_ALWAYS:

 step s1insert: insert into d4_fk values (1);
 +ERROR:  insert or update on table "d4_fk" violates foreign key constraint 
"d4_fk_a_fkey"

>  hyrax  │ 2021-03-27 07:29:34 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-03-27%2007%3A29%3A34
>  trilobite  │ 2021-03-29 18:14:24 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-03-29%2018%3A14%3A24
>  hyrax  │ 2021-04-01 07:21:10 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-01%2007%3A21%3A10
>  avocet │ 2021-04-05 15:45:56 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2021-04-05%2015%3A45%3A56
>  hyrax  │ 2021-04-06 07:15:16 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-06%2007%3A15%3A16
>  hyrax  │ 2021-04-11 07:25:50 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-11%2007%3A25%3A50
>  hyrax  │ 2021-04-20 18:25:37 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-20%2018%3A25%3A37

This is fine, because the fix commit 8aba932 is dated April 22 and these
failures all predate that.


And finally there's these two:

>  topminnow  │ 2021-03-28 20:37:38 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-03-28%2020%3A37%3A38
>  dragonet   │ 2021-04-01 19:48:03 │ HEAD   │ 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2021-04-01%2019%3A48%3A03

(animals not CCA) which are exposing the same problem in
detach-partition-concurrently-4 that we just fixed in
detach-partition-concurrently-3, so we should apply the same fix: add a
no-op step right after the cancel to prevent the error report from
changing.  I'll go do that after grabbing some coffee.

Thanks for digging into the reports!

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)




Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 03:20:06PM -0400, Bruce Momjian wrote:
> Also, when you change hint bits, either you don't change the nonce/LSN,
> and don't re-encrypt the page (and the hint bit changes are visible), or
> you change the nonce and reencrypt the page, and you are then WAL
> logging the page.  I don't see how having a nonce different from the LSN
> helps here.

Let me go into more detail here.  The general rule is that you never
encrypt _different_ data with the same key/nonce.  Now, since a hint bit
change changes the data, it should get a new nonce, and since it is a
newly encrypted page (using a new nonce), it should be WAL logged
because a torn page would make the data unreadable.

Now, if we want to consult some security experts and have them tell us
the hint bit visibility is not a problem, we could get by without using a
new nonce for hint bit changes, and in that case it doesn't matter if we
have a separate LSN or custom nonce --- it doesn't get changed for hint
bit changes.

My point is that we have to full-page-write cases where we change the
nonce --- we get a new LSN/nonce for free if we are using the LSN as the
nonce.  What has made this approach much easier is that you basically
tie a change of the nonce to require a change of LSN, since you are WAL
logging it and every nonce change has to be full-page-write WAL logged. 
This makes the LSN-as-nonce less fragile to breakage than a custom
nonce, in my opinion, which may explain why my patch is so small.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 03:34:04PM -0400, Bruce Momjian wrote:
> Let me go into more detail here.  The general rule is that you never
> encrypt _different_ data with the same key/nonce.  Now, since a hint bit
> change changes the data, it should get a new nonce, and since it is a
> newly encrypted page (using a new nonce), it should be WAL logged
> because a torn page would make the data unreadable.
> 
> Now, if we want to consult some security experts and have them tell us
> the hint bit visibility is not a problem, we could get by without using a
> new nonce for hint bit changes, and in that case it doesn't matter if we
> have a separate LSN or custom nonce --- it doesn't get changed for hint
> bit changes.
> 
> My point is that we have to full-page-write cases where we change the
> nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> nonce.  What has made this approach much easier is that you basically
> tie a change of the nonce to require a change of LSN, since you are WAL
> logging it and every nonce change has to be full-page-write WAL logged. 
> This makes the LSN-as-nonce less fragile to breakage than a custom
> nonce, in my opinion, which may explain why my patch is so small.

This issue is covered at the bottom of this patch to the README file:


https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch

Hint Bits
- - - - -

For hint bit changes, the LSN normally doesn't change, which is
a problem.  By enabling wal_log_hints, you get full page writes
to the WAL after the first hint bit change of the checkpoint.
This is useful for two reasons.  First, it generates a new LSN,
which is needed for the IV to be secure.  Second, full page images
protect against torn pages, which is an even bigger requirement for
encryption because the new LSN is re-encrypting the entire page,
not just the hint bit changes.  You can safely lose the hint bit
changes, but you need to use the same LSN to decrypt the entire
page, so a torn page with an LSN change cannot be decrypted.
To prevent this, wal_log_hints guarantees that the pre-hint-bit
version (and previous LSN version) of the page is restored.

However, if a hint-bit-modified page is written to the file system
during a checkpoint, and there is a later hint bit change switching
the same page from clean to dirty during the same checkpoint, we
need a new LSN, and wal_log_hints doesn't give us a new LSN here.
The fix for this is to update the page LSN by writing a dummy
WAL record via xloginsert.c::LSNForEncryption() in such cases.

Let me know if it needs more detai.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Robert Haas
On Tue, May 25, 2021 at 3:10 PM Tom Lane  wrote:
> No, you misunderstand my proposal.  The thing that I most urgently
> want to do is to prevent that situation from ever arising, by not
> allowing those two procedures to coexist, just as you can't have
> two functions with such signatures.
>
> If procedures are required to have distinct signatures when considering
> input parameters only, then a fortiori they are distinct when also
> considering output parameters.  So my proposal cannot make a CALL
> that includes output parameters ambiguous if it was not before.

Oh, OK.

I'm not sure what I think about that yet. It certainly seems to make
things less confusing. But on the other hand, I think that the
standard - or some competing systems - may have cases where they
disambiguate calls based on output arguments only. Granted, if we
prohibit that now, we can always change our minds and allow it later
if we are sure we've got everything figured out, whereas if we don't
prohibit now, backward compatibility will make it hard to prohibit it
later. But on the other hand I don't really fully understand Peter's
thinking here, so I'm a little reluctant to jump to the conclusion
that he's lost the way.

> > I don't see any really great choice here, but in some sense your
> > proposal seems like the worst of all the options. It does not reverse
> > the patch's choice to treat functions and procedures differently, so
> > users will still have to deal with that inconsistency.
>
> You're definitely confused, because reversing that choice is *exactly*
> what I'm on about.  The question of whether SQL-level CALL should act
> differently from plpgsql CALL is pretty secondary.

I understood the reverse from the first post on the thread, so perhaps
it is more that your thinking has developed than that I am confused.

However, it's possible that I only think that because I'm confused.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

2021-05-25 Thread Andrew Dunstan


On 5/25/21 6:55 AM, Aleksander Alekseev wrote:
> Hi hackers,
>
> Back in 2016 while being at PostgresPro I developed the ZSON extension
> [1]. The extension introduces the new ZSON type, which is 100%
> compatible with JSONB but uses a shared dictionary of strings most
> frequently used in given JSONB documents for compression. These
> strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4)
> applies if the document is large enough by common PostgreSQL logic.
> Under certain conditions (many large documents), this saves disk
> space, memory and increases the overall performance. More details can
> be found in README on GitHub.
>
> The extension was accepted warmly and instantaneously I got several
> requests to submit it to /contrib/ so people using Amazon RDS and
> similar services could enjoy it too. Back then I was not sure if the
> extension is mature enough and if it lacks any additional features
> required to solve the real-world problems of the users. Time showed,
> however, that people are happy with the extension as it is. There were
> several minor issues discovered, but they were fixed back in 2017. The
> extension never experienced any compatibility problems with the next
> major release of PostgreSQL.
>
> So my question is if the community may consider adding ZSON to
> /contrib/. If this is the case I will add this thread to the nearest
> CF and submit a corresponding patch.
>
> [1]: https://github.com/postgrespro/zson
> 
>
We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years 
ago, and I have permission to contribute those if this proposal is adopted. 
From the readme:

1. There is an option to make zson_learn only process object keys,
rather than field values.

```
select zson_learn('{{table1,col1}}',true);
```

2. Strings with an octet-length less than 3 are not processed.
Since strings are encoded as 2 bytes and then there needs to be
another byte with the length of the following skipped bytes, encoding
values less than 3 bytes is going to be a net loss.

3. There is a new function to create a dictionary directly from an
array of text, rather than using the learning code:

```
select zson_create_dictionary(array['word1','word2']::text[]);
```

4. There is a function to augment the current dictionary from an array of text:

```
select zson_extend_dictionary(array['value1','value2','value3']::text[]);
```

This is particularly useful for adding common field prefixes or values. A good
example of field prefixes is URL values where the first part of the URL is
fairly constrained but the last part is not.


cheers

andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Add ZSON extension to /contrib/

2021-05-25 Thread Tom Lane
Magnus Hagander  writes:
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
>  wrote:
>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
>> The extension introduces the new ZSON type, which is 100% compatible with 
>> JSONB but uses a shared dictionary of strings most frequently used in given 
>> JSONB documents for compression.

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

IMO we have too d*mn many JSON types already.  If we can find a way
to shoehorn this optimization into JSONB, that'd be great.  Otherwise
I do not think it's worth the added user confusion.

Also, even if ZSON was "100% compatible with JSONB" back in 2016,
a whole lot of features have been added since then.  Having to
duplicate all that code again for a different data type is not
something I want to see us doing.  So that's an independent reason
for wanting to hide this under the existing type not make a new one.

regards, tom lane




Re: Add ZSON extension to /contrib/

2021-05-25 Thread Matthias van de Meent
On Tue, 25 May 2021 at 13:32, Magnus Hagander  wrote:
>
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
>  wrote:
> >
> > Hi hackers,
> >
> > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
> > The extension introduces the new ZSON type, which is 100% compatible with 
> > JSONB but uses a shared dictionary of strings most frequently used in given 
> > JSONB documents for compression. These strings are replaced with integer 
> > IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough 
> > by common PostgreSQL logic. Under certain conditions (many large 
> > documents), this saves disk space, memory and increases the overall 
> > performance. More details can be found in README on GitHub.
> >
> > The extension was accepted warmly and instantaneously I got several 
> > requests to submit it to /contrib/ so people using Amazon RDS and similar 
> > services could enjoy it too.

Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.

> > Back then I was not sure if the extension is mature enough and if it lacks 
> > any additional features required to solve the real-world problems of the 
> > users. Time showed, however, that people are happy with the extension as it 
> > is. There were several minor issues discovered, but they were fixed back in 
> > 2017. The extension never experienced any compatibility problems with the 
> > next major release of PostgreSQL.
> >
> > So my question is if the community may consider adding ZSON to /contrib/. 
> > If this is the case I will add this thread to the nearest CF and submit a 
> > corresponding patch.

I like the idea of the ZSON type, but I'm somewhat disappointed by its
current limitations:

- There is only one active shared dictionary (as a user I would want
distinct dictionaries for each use case, similar to ENUM: each ENUM
type has their own limit of 2**31 (?) values)
- There is no provided method to manually specify the dictionary (only
"zson_learn", which constructs a new dictionary)
- You cannot add to the dictionary (equiv. to ALTER TYPE enum_type ADD
VALUE), you must create a new one.

Apart from that, I noticed the following more technical points, for if
you submit it as-is as a patch:

- Each dictionary uses a lot of memory, regardless of the number of
actual stored keys. For 32-bit systems the base usage of a dictionary
without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
more than I'd want to install.
- You call gettimeofday() in both dict_get and in get_current_dict_id.
These functions can be called in short and tight loops (for small GSON
fields), in which case it would add significant overhead through the
implied syscalls.
- The compression method you've chosen seems to extract most common
strings from the JSONB table, and then use that as a pre-built
dictionary for doing some dictionary encoding on the on-disk format of
the jsonb structure. Although I fully understand that this makes the
system quite easy to reason about, it does mean that you're deTOASTing
the full GSON field, and that the stored bytestring will not be
structured / doesn't work well with current debuggers.

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

I don't think that this datatype (that supplies a basic but effective
compression algorithm over JSONB) is fit for core as-is.

I have also thought about building a similar type, but one that would
be more like ENUM: An extension on the JSONB datatype, which has some
list of common 'well-known' values that will be substituted, and to
which later more substitutable values can be added (e.g. CREATE TYPE
... AS JSONB_DICTIONARY  ('"commonly_used_key"',
'"very_long_string_that_appears_often"', '[{"structure": "that",
"appears": "often"}]') or something similar). That would leave JSONB
just as a JSONB_DICTIONARY type without any substitutable values.

These specialized JSONB types could then be used as a specification
for table columns, custom types, et cetera. Some of the reasons I've
not yet built such type is me not being familiar with the jsonb- and
enum-code (which I suspect to be critical for an efficient
implementation of such type), although whilst researching I've noticed
that it is possible to use most of the JSONB infrastructure / read
older jsonb values, as there are still some JEntry type masks
available which could flag such substitutions.

With regards,

Matthias van de Meent




Re: CALL versus procedures with output-only arguments

2021-05-25 Thread Tom Lane
Robert Haas  writes:
> On Tue, May 25, 2021 at 3:10 PM Tom Lane  wrote:
>> You're definitely confused, because reversing that choice is *exactly*
>> what I'm on about.  The question of whether SQL-level CALL should act
>> differently from plpgsql CALL is pretty secondary.

> I understood the reverse from the first post on the thread, so perhaps
> it is more that your thinking has developed than that I am confused.

Yeah, the odd behavior of CALL is where I started from, but now I think
the main problem is with the signature (ie, allowing procedures with
signatures that differ only in OUT parameter positions).  If we got
rid of that choice then it'd be possible to document that you should
only ever write NULL for OUT-parameter positions, because the type
of such an argument would never be significant for disambiguation.

We could consider going further and actually enforcing use of NULL,
or inventing some other syntactic placeholder such as the '?' that
Peter was speculating about.  But I'm not sure that that adds much.

Relevant to this is that my proposed patch gets rid of the existing
behavior that such arguments actually get evaluated.  That would
need to be documented, unless we go with the placeholder approach.
But I've not spent time on the documentation yet.

regards, tom lane




Re: Add ZSON extension to /contrib/

2021-05-25 Thread Andrew Dunstan


On 5/25/21 4:10 PM, Tom Lane wrote:
> Magnus Hagander  writes:
>> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
>>  wrote:
>>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
>>> The extension introduces the new ZSON type, which is 100% compatible with 
>>> JSONB but uses a shared dictionary of strings most frequently used in given 
>>> JSONB documents for compression.
>> If the extension is mature enough, why make it an extension in
>> contrib, and not instead either enhance the existing jsonb type with
>> it or make it a built-in type?
> IMO we have too d*mn many JSON types already.  If we can find a way
> to shoehorn this optimization into JSONB, that'd be great.  Otherwise
> I do not think it's worth the added user confusion.
>
> Also, even if ZSON was "100% compatible with JSONB" back in 2016,
> a whole lot of features have been added since then.  Having to
> duplicate all that code again for a different data type is not
> something I want to see us doing.  So that's an independent reason
> for wanting to hide this under the existing type not make a new one.



I take your point. However, there isn't really any duplication. It's
handled by this:


CREATE FUNCTION jsonb_to_zson(jsonb)
    RETURNS zson
    AS 'MODULE_PATHNAME'
    LANGUAGE C STRICT IMMUTABLE;

CREATE FUNCTION zson_to_jsonb(zson)
    RETURNS jsonb
    AS 'MODULE_PATHNAME'
    LANGUAGE C STRICT IMMUTABLE;

CREATE CAST (jsonb AS zson) WITH FUNCTION jsonb_to_zson(jsonb) AS
ASSIGNMENT;
CREATE CAST (zson AS jsonb) WITH FUNCTION zson_to_jsonb(zson) AS
IMPLICIT;


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: storing an explicit nonce

2021-05-25 Thread Stephen Frost
Greetings,

On Tue, May 25, 2021 at 14:56 Bruce Momjian  wrote:

> On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> > One question here is whether we're comfortable saying that the nonce
> > is entirely constant. I wasn't sure about that. It seems possible to
> > me that different encryption algorithms might want nonces of different
> > sizes, either now or in the future. I am not a cryptographer, but that
> > seemed like a bit of a limiting assumption. So Bharath and I decided
> > to make the POC cater to a fully variable-size nonce rather than
> > zero-or-some-constant. However, if the consensus is that
> > zero-or-some-constant is better, fair enough! The patch can certainly
> > be adjusted to cater to work that way.
>
> A 16-byte nonce is sufficient for AES and I doubt we will need anything
> stronger than AES256 anytime soon.  Making the nonce variable length
> seems it is just adding complexity for little purpose.


I’d like to review this more and make sure using the special space is
possible but if it is then it opens up a huge new possibility that we could
use it for both the nonce AND an appropriately sized tag, giving us
integrity along with encryption which would be a very significant
additional feature.  I’d considered using a fork instead but having it on
the page would be far better.

I’ll also note that we could consider possibly even find an alternative use
for the space used for checksums, or leave them as they are today, though
they’d be redundant at that point to the tag.

Lastly, if the special space is actually able to be variable in size and we
could, say, store a flag in pg_class which tells us what’s in the special
space, then we could possibly give users the option of including the tag on
each page, or the choice of the size of tag, or possibly for other
interesting things in the future outside of encryption and data integrity.

Overall, I’m quite interested in the idea of making the special space able
to be variable. I do accept that will make it so it’s not possible to do
things like have physical replication between an unencrypted cluster and an
encrypted one, but the advantages seem worthwhile and users would still be
able to leverage logical replication to perform such a migration with
relatively little downtime.

Thanks!

Stephen

>


Re: Add ZSON extension to /contrib/

2021-05-25 Thread Tom Lane
Andrew Dunstan  writes:
> On 5/25/21 4:10 PM, Tom Lane wrote:
>> Also, even if ZSON was "100% compatible with JSONB" back in 2016,
>> a whole lot of features have been added since then.  Having to
>> duplicate all that code again for a different data type is not
>> something I want to see us doing.  So that's an independent reason
>> for wanting to hide this under the existing type not make a new one.

> I take your point. However, there isn't really any duplication. It's
> handled by [ creating a pair of casts ]

If that were an adequate solution then nobody would be unhappy about
json vs jsonb.  I don't think it really is satisfactory:

* does nothing for user confusion (except maybe make it worse)

* not terribly efficient

* doesn't cover all cases, notably indexes.

regards, tom lane




Re: storing an explicit nonce

2021-05-25 Thread Stephen Frost
Greetings,

On Tue, May 25, 2021 at 15:09 Robert Haas  wrote:

> On Tue, May 25, 2021 at 2:45 PM Bruce Momjian  wrote:
> > Well, if we create a separate nonce counter, we still need to make sure
> > it doesn't go backwards during a crash, so we have to WAL log it
>
> I think we don't really need a global counter, do we? We could simply
> increment the nonce every time we write the page. If we want to avoid
> using the same IV for different pages, then 8 bytes of the nonce could
> store a value that's different for every page, and the other 8 bytes
> could store a counter. Presumably we won't manage to write the same
> page more than 2^64 times, since LSNs are limited to be <2^64, and
> those are consumed more than 1 byte at a time for every change to any
> page anywhere.


The nonce does need to be absolutely unique for a given encryption key and
therefore needs to be global in some form.

Thanks!

Stephen


Re: Add ZSON extension to /contrib/

2021-05-25 Thread Tom Lane
Matthias van de Meent  writes:
> I like the idea of the ZSON type, but I'm somewhat disappointed by its
> current limitations:

I've not read the code, so maybe this thought is completely off-point,
but I wonder if anything could be learned from PostGIS.  AIUI they
have developed the infrastructure needed to have auxiliary info
(particularly, spatial reference data) attached to a geometry column,
without duplicating it in every value of the column.  Seems like that
is a close analog of what's needed here.

regards, tom lane




Re: storing an explicit nonce

2021-05-25 Thread Andres Freund
Hi,

On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote:
> My point is that we have to full-page-write cases where we change the
> nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> nonce.  What has made this approach much easier is that you basically
> tie a change of the nonce to require a change of LSN, since you are WAL
> logging it and every nonce change has to be full-page-write WAL logged.
> This makes the LSN-as-nonce less fragile to breakage than a custom
> nonce, in my opinion, which may explain why my patch is so small.

This disregards that we need to be able to increment nonces on standbys
/ during crash recovery.

It may look like that's not needed, with an (wrong!) argument like: The
only writes come from crash recovery, which always are associated with a
WAL record, guaranteeing nonce increases. Hint bits are not an issue
because they don't mark the buffer dirty.

But unfortunately that analysis is wrong. Consider the following
sequence:

1) replay record LSN X affecting page Y (FPI replay)
2) write out Y, encrypt Y using X as nonce
3) crash
4) replay record LSN X affecting page Y (FPI replay)
5) hint bit update to Y, resulting in Y'
6) write out Y', encrypt Y' using X as nonce

While 5) did not mark the page as dirty, it still modified the page
contents. Which means that we'd encrypt different content with the same
nonce - which is not allowed.

I'm pretty sure that there's several other ways to end up with page
contents that differ, despite the LSN not changing.

Greetings,

Andres Freund




Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 01:54:21PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote:
> > My point is that we have to full-page-write cases where we change the
> > nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> > nonce.  What has made this approach much easier is that you basically
> > tie a change of the nonce to require a change of LSN, since you are WAL
> > logging it and every nonce change has to be full-page-write WAL logged.
> > This makes the LSN-as-nonce less fragile to breakage than a custom
> > nonce, in my opinion, which may explain why my patch is so small.
> 
> This disregards that we need to be able to increment nonces on standbys
> / during crash recovery.
> 
> It may look like that's not needed, with an (wrong!) argument like: The
> only writes come from crash recovery, which always are associated with a
> WAL record, guaranteeing nonce increases. Hint bits are not an issue
> because they don't mark the buffer dirty.
> 
> But unfortunately that analysis is wrong. Consider the following
> sequence:
> 
> 1) replay record LSN X affecting page Y (FPI replay)
> 2) write out Y, encrypt Y using X as nonce
> 3) crash
> 4) replay record LSN X affecting page Y (FPI replay)
> 5) hint bit update to Y, resulting in Y'
> 6) write out Y', encrypt Y' using X as nonce
> 
> While 5) did not mark the page as dirty, it still modified the page
> contents. Which means that we'd encrypt different content with the same
> nonce - which is not allowed.
> 
> I'm pretty sure that there's several other ways to end up with page
> contents that differ, despite the LSN not changing.

Yes, I can see that happening.  I think occasional leakage of hint bit
changes to be acceptable.  We might decide they are all acceptable.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, May 25, 2021 at 03:20:06PM -0400, Bruce Momjian wrote:
> > Also, when you change hint bits, either you don't change the nonce/LSN,
> > and don't re-encrypt the page (and the hint bit changes are visible), or
> > you change the nonce and reencrypt the page, and you are then WAL
> > logging the page.  I don't see how having a nonce different from the LSN
> > helps here.
> 
> Let me go into more detail here.  The general rule is that you never
> encrypt _different_ data with the same key/nonce.  Now, since a hint bit
> change changes the data, it should get a new nonce, and since it is a
> newly encrypted page (using a new nonce), it should be WAL logged
> because a torn page would make the data unreadable.

Right.

> Now, if we want to consult some security experts and have them tell us
> the hint bit visibility is not a problem, we could get by without using a
> new nonce for hint bit changes, and in that case it doesn't matter if we
> have a separate LSN or custom nonce --- it doesn't get changed for hint
> bit changes.

I do think it's reasonable to consider having hint bits not included in
the encrypted part of the page and therefore remove the need to produce
a new nonce for each hint bit change.  Naturally, there's always an
increased risk when any data in the system isn't encrypted but given
the other parts of the system which aren't being encrypted as part of
this effort it hardly seems like a significant increase of overall risk.
I don't believe that any of the auditors and security teams I've
discussed TDE with would have issue with hint bits not being encrypted-
the principle concern has always been the primary data.

Naturally, the more we are able to encrypt and the more we can do to
provide data integrity validation, may open up the possibility for PG to
be used in even more places, which argues for having some way of making
these choices be options which a user could decide at initdb time, or at
least contemplating a road map to where we could offer users the option
to have other parts of the system be encrypted and ideally have data
integrity checks, but I don't think we necessarily have to solve
everything right now in that regard- just having TDE in some form will
open up quite a few new possibilities for v15, even if it doesn't
include data integrity validation beyond our existing checksums and
doesn't encrypt hint bits.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 04:29:08PM -0400, Stephen Frost wrote:
> Greetings,
> 
> On Tue, May 25, 2021 at 14:56 Bruce Momjian  wrote:
> 
> On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> > One question here is whether we're comfortable saying that the nonce
> > is entirely constant. I wasn't sure about that. It seems possible to
> > me that different encryption algorithms might want nonces of different
> > sizes, either now or in the future. I am not a cryptographer, but that
> > seemed like a bit of a limiting assumption. So Bharath and I decided
> > to make the POC cater to a fully variable-size nonce rather than
> > zero-or-some-constant. However, if the consensus is that
> > zero-or-some-constant is better, fair enough! The patch can certainly
> > be adjusted to cater to work that way.
> 
> A 16-byte nonce is sufficient for AES and I doubt we will need anything
> stronger than AES256 anytime soon.  Making the nonce variable length
> seems it is just adding complexity for little purpose.
> 
> 
> I’d like to review this more and make sure using the special space is possible
> but if it is then it opens up a huge new possibility that we could use it for
> both the nonce AND an appropriately sized tag, giving us integrity along with
> encryption which would be a very significant additional feature.  I’d
> considered using a fork instead but having it on the page would be far better.

We already discussed that there are too many other ways to break system
integrity that are not encrypted/integrity-checked, e.g., changes to
clog.  Do you disagree?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Add ZSON extension to /contrib/

2021-05-25 Thread Andrew Dunstan


On 5/25/21 4:31 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 5/25/21 4:10 PM, Tom Lane wrote:
>>> Also, even if ZSON was "100% compatible with JSONB" back in 2016,
>>> a whole lot of features have been added since then.  Having to
>>> duplicate all that code again for a different data type is not
>>> something I want to see us doing.  So that's an independent reason
>>> for wanting to hide this under the existing type not make a new one.
>> I take your point. However, there isn't really any duplication. It's
>> handled by [ creating a pair of casts ]
> If that were an adequate solution then nobody would be unhappy about
> json vs jsonb.  I don't think it really is satisfactory:
>
> * does nothing for user confusion (except maybe make it worse)
>
> * not terribly efficient
>
> * doesn't cover all cases, notably indexes.
>
>   


Quite so. To some extent it's a toy. But at least one of our customers
has found it useful, and judging by Aleksander's email they aren't
alone. Your ideas downthread are probably a useful pointer of how we
might fruitfully proceed.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 05:04:50PM -0400, Stephen Frost wrote:
> > Now, if we want to consult some security experts and have them tell us
> > the hint bit visibility is not a problem, we could get by without using a
> > new nonce for hint bit changes, and in that case it doesn't matter if we
> > have a separate LSN or custom nonce --- it doesn't get changed for hint
> > bit changes.
> 
> I do think it's reasonable to consider having hint bits not included in
> the encrypted part of the page and therefore remove the need to produce
> a new nonce for each hint bit change.  Naturally, there's always an
> increased risk when any data in the system isn't encrypted but given
> the other parts of the system which aren't being encrypted as part of
> this effort it hardly seems like a significant increase of overall risk.
> I don't believe that any of the auditors and security teams I've
> discussed TDE with would have issue with hint bits not being encrypted-
> the principle concern has always been the primary data.

OK, this is good to know.  I know the never-reuse rule, so it is good to
know it can be relaxed for certain data without causing problems in
other places.  Should I modify my patch to do this?

FYI, technically, the hint bit is still encrypted, but could _flip_ in
the encrypted file if changed, so that's why we say it is visible.  If
we used a block cipher instead of a streaming one (CTR), this might not
work because the earlier blocks can be based in the output of later
blocks.

> Naturally, the more we are able to encrypt and the more we can do to
> provide data integrity validation, may open up the possibility for PG to
> be used in even more places, which argues for having some way of making
> these choices be options which a user could decide at initdb time, or at
> least contemplating a road map to where we could offer users the option
> to have other parts of the system be encrypted and ideally have data
> integrity checks, but I don't think we necessarily have to solve
> everything right now in that regard- just having TDE in some form will
> open up quite a few new possibilities for v15, even if it doesn't
> include data integrity validation beyond our existing checksums and
> doesn't encrypt hint bits.

I am thinking full-file system encryption should still be used by people
needing that.  I am concerned that if we add too many
restrictions/additions on this feature, it will not be very useful.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: storing an explicit nonce

2021-05-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, May 25, 2021 at 01:54:21PM -0700, Andres Freund wrote:
> > On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote:
> > > My point is that we have to full-page-write cases where we change the
> > > nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> > > nonce.  What has made this approach much easier is that you basically
> > > tie a change of the nonce to require a change of LSN, since you are WAL
> > > logging it and every nonce change has to be full-page-write WAL logged.
> > > This makes the LSN-as-nonce less fragile to breakage than a custom
> > > nonce, in my opinion, which may explain why my patch is so small.
> > 
> > This disregards that we need to be able to increment nonces on standbys
> > / during crash recovery.
> > 
> > It may look like that's not needed, with an (wrong!) argument like: The
> > only writes come from crash recovery, which always are associated with a
> > WAL record, guaranteeing nonce increases. Hint bits are not an issue
> > because they don't mark the buffer dirty.
> > 
> > But unfortunately that analysis is wrong. Consider the following
> > sequence:
> > 
> > 1) replay record LSN X affecting page Y (FPI replay)
> > 2) write out Y, encrypt Y using X as nonce
> > 3) crash
> > 4) replay record LSN X affecting page Y (FPI replay)
> > 5) hint bit update to Y, resulting in Y'
> > 6) write out Y', encrypt Y' using X as nonce
> > 
> > While 5) did not mark the page as dirty, it still modified the page
> > contents. Which means that we'd encrypt different content with the same
> > nonce - which is not allowed.
> > 
> > I'm pretty sure that there's several other ways to end up with page
> > contents that differ, despite the LSN not changing.
> 
> Yes, I can see that happening.  I think occasional leakage of hint bit
> changes to be acceptable.  We might decide they are all acceptable.

I don't think that I agree with the idea that this would ultimately only
leak the hint bits- I'm fairly sure that this would make it relatively
trivial for an attacker to be able to deduce the contents of the entire
8k page.  I don't know that we should be willing to accept that as a
part of regular operation (which we generally view crashes as being).  I
had thought there was something in place to address this though.  If
not, it does seem like there should be.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: storing an explicit nonce

2021-05-25 Thread Bruce Momjian
On Tue, May 25, 2021 at 05:14:24PM -0400, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > Yes, I can see that happening.  I think occasional leakage of hint bit
> > changes to be acceptable.  We might decide they are all acceptable.
> 
> I don't think that I agree with the idea that this would ultimately only
> leak the hint bits- I'm fairly sure that this would make it relatively
> trivial for an attacker to be able to deduce the contents of the entire
> 8k page.  I don't know that we should be willing to accept that as a
> part of regular operation (which we generally view crashes as being).  I
> had thought there was something in place to address this though.  If
> not, it does seem like there should be.

Uh, can you please explain more?  Would the hint bits leak?  In another
email you said hint bit leaking was OK.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





  1   2   >