Re: Asynchronous Append on postgres_fdw nodes.

2021-01-02 Thread Etsuro Fujita
On Sun, Dec 20, 2020 at 5:15 PM Etsuro Fujita  wrote:
> On Thu, Nov 26, 2020 at 10:28 AM movead...@highgo.ca
>  wrote:
> > Issue one:
> > Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
> > ExecAppendAsyncRequest() function when I use more than two foreign table
> > on different foreign server.
> >
> > I research the code and do such change then the Assert problom disappear.
>
> Could you show a test case causing the assertion failure?

I happened to reproduce the same failure in my environment.

I think your change would be correct, but I changed the patch so that
it doesn’t need as_lastasyncplan anymore [1].  The new version of the
patch works well for my case.  So, could you test your case with it?

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK17L0j6otssa53ZvjnCsjguJHZXaqPL2HU_LDoZ4ATZjEw%40mail.gmail.com




Re: Proposed patch for key management

2021-01-02 Thread Fabien COELHO


Hello Stephen,


I'm unsure about what is the "common use case". ISTM that having the
postgres process holding the master key looks like a "no" for me in any use
case with actual security concern: as the database must be remotely
accessible, a reasonable security assumption is that a pg account could be
compromised, and the "master key" (if any, that is just one particular
cryptographic design) should not be accessible in that case. The first
barrier would be pg admin account. With external, the options are that the
key is hold by another id, so the host must be compromised as well, and
could even be managed remotely on another host (I agree that this later
option would be fragile because of the implied network connection, but it
would be a tradeoff depending on the security concerns, but it pretty much
correspond to the kerberos model).


No, this isn't anything like the Kerberos model and there's no case
where the PG account won't have access to the DEK (data encryption key)
during normal operation (except with the possibility of off-loading to a
hardware crypto device which, while interesting, is definitely something
that's of very limited utility and which could be added on as a
capability later anyway.  This is also well understood by those who are
interested in this capability and it's perfectly acceptable that PG will
have, in memory, the DEK.


I'm sorry that I'm not very clear. My ranting is having a KEK "master key" 
in pg memory. I think I'm fine at this point with having DEKs available on 
the host to code/decode files. What I meant about kerberos is that the 
setup I was describing was making the database dependent on a remote host, 
which is somehow what keroberos does.



The keys which are stored on disk with the proposed patch are encrypted
using a KEK which is external to PG- that's all part of the existing
patch and design.


Yep. My point is that the patch hardwires a cryptographic design with many 
assumptions, whereas I think it should design an API compatible with a 
larger class of designs, and possibly implement one with KEK and so on, 
I'm fine with that.



Adding a pre-defined system will not prevent future work on a
completely external option.


Yes and no.

Having two side-by-side cluster-encryption scheme in core, the first that
could be implemented on top of the second without much effort, would look
pretty weird. Moreover, required changes in core to allow an API are the
very same as the one in this patch. The other concern I have with doing the
cleaning work afterwards is that the API would be somehow in the middle of
the existing functions if it has not been thought of before.


This has been considered and the functions which are proposed are
intentionally structured to allow for multiple implementations already,


Does it? This is unclear to me from looking at the code, and the limited
documentation does not point to that. I can see that the "kek provider" 
and the "random provider" can be changed, but the overall cryptographic 
design seems hardwired.



so it's unclear to me what the argument here is.


The argument is that professional cryptophers do wrong designs, and I do 
not see why you would do different. So instead of hardwiring choice that 
you think are THE good ones, ISTM that pg should design a reasonably 
flexible API, and also provide an implementation, obviously.


Further, we haven't even gotten to actual cluster encryption yet- this 
is all just the key management side of things,


With hardwired choices about 1 KEK and 2 DEK that are debatable, see 
below.


which is absolutely a necessary and important part but argueing that it 
doesn't address the cluster encryption approach in core simply doesn't 
make any sense as that's not a part of this patch.


Let's have that discussion when we actually get to the point of talking
about cluster encryption.


I know archive_command is completely external, but that is much simpler
than what would need to be done for key management, key rotation, and key
verification.


I'm not sure of the design of the key rotation stuff as I recall it from the
patches I looked at, but the API design looks like the more pressing issue.
First, the mere existence of an "master key" is a cryptographic choice which
is debatable, because it creates issues if one must be able to change it,
hence the whole "key rotation" stuff. ISTM that what core needs to know is
that one particular key is changed by a new one and the underlying file is
rewritten. It should not need to know anything about master keys and key
derivation at all. It should need to know that the user wants to change file
keys, though.


No, it's not debatable that a KEK is needed, I disagree entirely.


ISTM that there is cryptographic design which suits your needs and you 
seem to decide that it is the only possible way to do it It seems that you 
know. As a researcher, I know that I do not know:-) As a software 
engineer, I'm trying to suggest enabling more with the 

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-02 Thread Noah Misch
On Sat, Jan 02, 2021 at 12:31:45PM +0500, Andrey Borodin wrote:
> Do I understand correctly that this is bugfix that needs to be back-patched?

The slru-truncate-modulo patch fixes a bug.  The slru-truncate-t-insurance
patch does not.  Neither _needs_ to be back-patched, though I'm proposing to
back-patch both.  I welcome opinions about that.

> Thus we should not refactor 4 identical *PagePrecedes(int page1, int page2) 
> into 1 generic function?

I agree with not refactoring that way, in this case.

> Since functions are not symmetric anymore, maybe we should have better names 
> for arguments than "page1" and "page2"? At least in dev branch.

That works for me.  What names would you suggest?

> Is it common practice to embed tests into assert checking like in 
> SlruPagePrecedesUnitTests()?

No; it's neither common practice nor a policy breach.




Re: faster ETL / bulk data load for heap tables

2021-01-02 Thread Luc Vlaming

On 02-01-2021 08:36, Amit Kapila wrote:

On Fri, Jan 1, 2021 at 7:37 PM Luc Vlaming  wrote:


Hi,

In an effort to speed up bulk data loading/transforming I noticed that
considerable time is spent in the relation extension lock.



We already do extend the relation in bulk when there is a contention
on relation extension lock via RelationAddExtraBlocks. I wonder why is
that not able to address this kind of workload. On a quick look at
your patch, it seems you are always trying to extend the relation by
128 blocks for copy operation after acquiring the lock whereas the
current mechanism has some smarts where it decides based on the number
of waiters. Now, it is possible that we should extend by a larger
number of blocks as compared to what we are doing now but using some
ad-hoc number might lead to space wastage. Have you tried to fiddle
with the current scheme of bulk-extension to see if that addresses
some of the gains you are seeing? I see that you have made quite a few
other changes that might be helping here but still, it is better to
see how much bottleneck is for relation extension lock and if that can
be addressed with the current mechanism rather than changing the
things in a different way.



Hi,

Thanks for looking at the patch!

Yes I tried that. I guess I should have also shared all other things I 
have tried before I ended up with these patches.
I've tried to improve RelationAddExtraBlocks to extend the mechanism to 
more aggresively allocate blocks, to not put them in the FSM immediately 
as this also seemed like a point of contention, extending ReadBufferBI 
to allocate several pages in a loop, and a few more variants like this.
To be sure I just tried again a few variants where I made e.g. 
extraBlocks=128, 256, etc, and disabled e.g. the FSM code. None of those 
grant very big performance gains or actually make it slower, suggesting 
the code is parameterized quite well already for the current design.


The main problem is that the relation extension lock is taken to extend 
one block at a time, whilst doing (expensive) syscalls like pwrite(). 
Even though we then put these blocks immediately in the FSM and such, 
the bottleneck stays the extension of the file itself, no matter how 
many blocks we allocate in a loop in RelationAddExtraBlocks.


Therefore what I've set out to do is:
- make the relation extension lock taken as short as possible.
- reduce the time spent on syscalls as much as possible.

This resulted in a design which hands out blocks to a specific backend 
so that everything after the file extension can be done safely without 
locks. Given that the current API did not allow this to be specified 
properly for now I added extra functions which have just as purpose to 
extend the relation so that they could be purpose built for this. I did 
not want to suggest this is however the best API there could be. The 
current state of this patch is in that sense still very crude. I just 
wrote the code I needed to be able to do performance testing. So it is 
as you rightfully pointed out quite ad-hoc and not very generic, nor the 
right design, has quite some code duplication, etc.


I was at this point mostly looking for feedback on the design/approach. 
If this requires a (much) cleaner and more productified patch then I can 
arrange that. However I thought to first find out if this approach makes 
sense at all before doing more testing to make this more generic.


Kind regards,
Luc




Re: faster ETL / bulk data load for heap tables

2021-01-02 Thread Luc Vlaming

On 01-01-2021 19:55, Zhihong Yu wrote:

Hi, Luc:
Happy New Year.

Looking at BufferAllocExtend() 
in v1-0002-WIP-buffer-alloc-specialized-for-relation-extensi.patch. it 
seems there is duplicate code with the existing BufferAlloc().


It would be good if some refactoring is done by extracting common code 
into a helper function.


Thanks



Hi,

Thanks! Happy new year to you too!

Thanks for your suggestion. I would wait a bit and first get some 
feedback on the design/approach of my patches before doing the 
refactoring. The current code is very much a WIP where I just copied 
functions to be able to make specialized variants of them aimed at bulk 
extension and then benchmark those. If the refactoring needs to be done 
before I can get feedback on the design / approach then let me know.


Kind regards,
Luc




Re: Proposed patch for key management

2021-01-02 Thread Alastair Turner
Hi Fabien

On Sat, 2 Jan 2021 at 09:50, Fabien COELHO  wrote:
>
...
> ISTM that pg at the core level should (only) directly provide:
>
> (1) a per-file encryption scheme, with loadable (hook-replaceable
> functions??) to manage pages, maybe:
>
>encrypt(page_id, *key, *clear_page, *encrypted_page);
>decrypt(page_id, *key, *encrypted_page, *clear_page);
>
> What encrypt/decrypt does is beyond pg core stuff. Ok, a reasonable
> implementation should be provided, obviously, possibly in core. There may
> be some block-size issues if not full pages are encrypted, so maybe the
> interface needs to be a little more subtle.
>

There are a lot of specifics of the encryption implementation which
need to be addressed in future patches. This patch focuses on making
keys available to the encryption processes at run-time, so ...

>
> (2) offer a key management scheme interface, to manage *per-file* keys,
> possibly loadable (hook replaceable?). If all files have the same key,
> which is stored in a directory and encoded with a KEK, this is just one
> (debatable) implementation choice. For that, ISTM that what is needed at
> this level is:
>
>get_key(file_id (relative name? oid? 8 or 16 bytes something?));
>

Per-cluster keys for permanent data and WAL allow a useful level of
protection, even if it could be improved upon. It's also going to be
quicker/simpler to implement, so any API should allow for it. If
there's an arbitrary number of DEK's, using a scope label for
accessing them sounds right, so "WAL", "local_data",
"local_data/tablespaceiod" or "local_data/dboid/tableoid".

>
...
> (3) ISTM that the key management interface should be external, or at least
> it should be possible to make it external easily. I do not think that
> there is a significant performance issue because keys are needed once, and
> once loaded they are there. A simple way to do that is a separate process
> with a basic protocol on stdin/stdout to implement "get_key", which is
> basically already half implemented in the patch for retrieving the KEK.
>

If keys can have arbitrary scope, then the pg backend won't know what
to ask for. So the API becomes even simpler with no specific request
on stdin and all the relevant keys on stdout. I generally like this
approach as well, and it will be the only option for some
integrations. On the other hand, there is an advantage to having the
key retrieval piece of key management in-process - the keys are not
being passed around in plain.

There is also a further validation task - probably beyond the scope of
the key management patch and into the encryption patch[es] territory -
checking that the keys supplied are the same keys in use for the data
currently on disk. It feels to me like this should be done at startup,
rather than as each file is accessed, which could make startup quite
slow if there are a lot of keys with narrow scope.

Regards
Alastair




Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-02 Thread Dmitry Dolgov
> On Thu, Dec 31, 2020 at 08:21:55PM +0100, Pavel Stehule wrote:
> čt 31. 12. 2020 v 15:27 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
> napsal:
>
> the tests passed and filling gaps works well
>
> but creating empty objects doesn't work
>
> create table foo(a jsonb);
>
> insert into foo values('{}');
>
> postgres=# update foo set a['k'][1] = '20';
> UPDATE 1
> postgres=# select * from foo;
> ┌───┐
> │ a │
> ╞═══╡
> │ {"k": [null, 20]} │
> └───┘
> (1 row)
>
> it is ok
>
> postgres=# update foo set a['k3'][10] = '20';
> UPDATE 1
> postgres=# select * from foo;
> ┌───┐
> │ a │
> ╞═══╡
> │ {"k": [null, 20]} │
> └───┘
> (1 row)
>
> the second update was not successful

Right, it was working only if the source level is empty, thanks for
checking. I've found a bit more time and prepared more decent version
which covers all the cases I could come up with following the same
implementation logic. The first patch is the same though.
>From c9143a620497dac5615c4de1d9349684e9af95b5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Fri, 18 Dec 2020 17:19:51 +0100
Subject: [PATCH v41 1/2] Subscripting for jsonb

Subscripting implementation for jsonb. It does not support slices, does
not have a limit for number of subscripts and for assignment expects a
replace value to be of jsonb type. There is also one functional
difference in assignment via subscripting from jsonb_set, when an
original jsonb container is NULL, subscripting replaces it with an empty
jsonb and proceed with assignment.

For the sake of code reuse, some parts of jsonb functionality were
rearranged to allow use the same functions for jsonb_set and assign
subscripting operation.

The original idea belongs to Oleg Bartunov.

Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule
---
 doc/src/sgml/json.sgml  |  48 
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/jsonb_util.c  |  76 -
 src/backend/utils/adt/jsonbsubs.c   | 413 
 src/backend/utils/adt/jsonfuncs.c   | 180 ++--
 src/include/catalog/pg_proc.dat |   4 +
 src/include/catalog/pg_type.dat |   3 +-
 src/include/utils/jsonb.h   |   6 +-
 src/test/regress/expected/jsonb.out | 272 +-
 src/test/regress/sql/jsonb.sql  |  84 +-
 10 files changed, 982 insertions(+), 105 deletions(-)
 create mode 100644 src/backend/utils/adt/jsonbsubs.c

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 5b9a5557a4..100d1a60f4 100644
--- a/doc/src/sgml/json.sgml
+++ b/doc/src/sgml/json.sgml
@@ -602,6 +602,54 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu
   
  
 
+ 
+  jsonb Subscripting
+  
+   jsonb data type supports array-style subscripting expressions
+   to extract or update particular elements. It's possible to use multiple
+   subscripting expressions to extract nested values. In this case, a chain of
+   subscripting expressions follows the same rules as the
+   path argument in jsonb_set function,
+   e.g. in case of arrays it is a 0-based operation or that negative integers
+   that appear in path count from the end of JSON arrays.
+   The result of subscripting expressions is always jsonb data type. An
+   example of subscripting syntax:
+
+-- Extract value by key
+SELECT ('{"a": 1}'::jsonb)['a'];
+
+-- Extract nested value by key path
+SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c'];
+
+-- Extract element by index
+SELECT ('[1, "2", null]'::jsonb)[1];
+
+-- Update value by key, note the single quotes - the assigned value
+-- needs to be of jsonb type as well
+UPDATE table_name SET jsonb_field['key'] = '1';
+
+-- Select records using where clause with subscripting. Since the result of
+-- subscripting is jsonb and we basically want to compare two jsonb objects, we
+-- need to put the value in double quotes to be able to convert it to jsonb.
+SELECT * FROM table_name WHERE jsonb_field['key'] = '"value"';
+
+
+  Subscripting for jsonb does not support slice expressions,
+  even if it contains an array.
+
+  In case if source jsonb is NULL, assignment
+  via subscripting will proceed as if it was an empty JSON object:
+
+-- If jsonb_field here is NULL, the result is {"a": 1}
+UPDATE table_name SET jsonb_field['a'] = '1';
+
+-- If jsonb_field here is NULL, the result is [1]
+UPDATE table_name SET jsonb_field[0] = '1';
+
+
+  
+ 
+
  
   Transforms
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 82732146d3..279ff15ade 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -50,6 +50,7 @@ OBJS = \
 	jsonb_op.o \
 	jsonb_util.o \
 	jsonfuncs.o \
+	jsonbsubs.o \
 	jsonpath.o \
 	jsonpath_exec.o \
 	jsonpath_gram.o \
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index 4eeffa1424..41a1c1f9bb 100644
--- a

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-02 Thread k.jami...@fujitsu.com
On Wednesday, December 30, 2020 8:58 PM, Amit Kapila wrote:
> On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying
>  wrote:
> >
> > Hi Amit,
> >
> > In last
> >
> mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e625718
> 2
> > 564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
> > I've sent you the performance test results(run only 1 time) on single table.
> Here is my the retested results(average by 15 times) which I think is more
> accurate.
> >
> > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is
> nonlinear(also include test results on shared buffers of 50G/60G), so it's a
> little difficult to decide the threshold from the two for me.
> > If just consider 100G, I think NBuffers/32 is the optimized max relation 
> > size.
> But I don't know how to judge for 20G. If you have any suggestion, kindly let
> me know.
> >
> 
> Considering these results NBuffers/64 seems a good threshold as beyond
> that there is no big advantage. BTW, it is not clear why the advantage for
> single table is not as big as multiple tables with the Truncate command. Can
> you share your exact test steps for any one of the tests?
> Also, did you change autovacumm = off for these tests, if not then the results
> might not be reliable because before you run the test via Vacuum command
> autovacuum would have done that work?

Happy new year. The V38 LGTM.
Apologies for a bit of delay on posting the test results, but since it's the
start of commitfest, here it goes and the results were interesting.

I executed a VACUUM test using the same approach that Tsunakawa-san did in [1],
but only this time, the total time that DropRelFileNodeBuffers() took.
I used only a single relation, tried with various sizes using the values of 
threshold:
NBuffers/512..NBuffers/1, as advised by Amit.

Example of relation sizes for NBuffers/512.
100GB shared_buffers:  200 MB 
20GB shared_buffers:   40 MB
1GB shared_buffers:2 MB
128MB shared_buffers:  0.25 MB

The regression, which means the patch performs worse than master, only happens
for relation size NBuffers/2 and below for all shared_buffers. The fastest
performance on a single relation was using the relation size NBuffers/512.

[VACUUM Recovery Performance on Single Relation]
Legend: P_XXX (Patch, NBuffers/XXX relation size),
M_XXX (Master, Nbuffers/XXX relation size)
Unit: seconds

| Rel Size | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| P_512|  0.012594  |  0.001989  |  0.81  |  0.12  | 
| M_512|  0.208757  |  0.046212  |  0.002013  |  0.000295  | 
| P_256|  0.026311  |  0.004416  |  0.000129  |  0.21  | 
| M_256|  0.241017  |  0.047234  |  0.002363  |  0.000298  | 
| P_128|  0.044684  |  0.009784  |  0.000290  |  0.42  | 
| M_128|  0.253588  |  0.047952  |  0.002454  |  0.000319  | 
| P_64 |  0.065806  |  0.017444  |  0.000521  |  0.75  | 
| M_64 |  0.268311  |  0.050361  |  0.002730  |  0.000339  | 
| P_32 |  0.121441  |  0.033431  |  0.001646  |  0.000112  | 
| M_32 |  0.285254  |  0.061486  |  0.003640  |  0.000364  | 
| P_16 |  0.255503  |  0.065492  |  0.001663  |  0.000144  | 
| M_16 |  0.377013  |  0.081613  |  0.003731  |  0.000372  | 
| P_8  |  0.560616  |  0.109509  |  0.005954  |  0.000465  | 
| M_8  |  0.692596  |  0.112178  |  0.006667  |  0.000553  | 
| P_4  |  1.109437  |  0.162924  |  0.011229  |  0.000861  | 
| M_4  |  1.162125  |  0.178764  |  0.011635  |  0.000935  | 
| P_2  |  2.202231  |  0.317832  |  0.020783  |  0.002646  | 
| M_2  |  1.583959  |  0.306269  |  0.015705  |  0.002021  | 
| P_1  |  3.080032  |  0.632747  |  0.032183  |  0.002660  | 
| M_1  |  2.705485  |  0.543970  |  0.030658  |  0.001941  | 

%reg = (Patched/Master)/Patched

| %reg_relsize | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| %reg_512 | -1557.587% | -2223.006% | -2385.185% | -2354.167% | 
| %reg_256 | -816.041%  | -969.691%  | -1731.783% | -1319.048% | 
| %reg_128 | -467.514%  | -390.123%  | -747.008%  | -658.333%  | 
| %reg_64  | -307.727%  | -188.704%  | -423.992%  | -352.000%  | 
| %reg_32  | -134.891%  | -83.920%   | -121.097%  | -225.970%  | 
| %reg_16  | -47.557%   | -24.614%   | -124.279%  | -157.390%  | 
| %reg_8   | -23.542%   | -2.437%| -11.967%   | -19.010%   | 
| %reg_4   | -4.749%| -9.722%| -3.608%| -8.595%| 
| %reg_2   | 28.075%| 3.638% | 24.436%| 23.615%| 
| %reg_1   | 12.160%| 14.030%| 4.739% | 27.010%| 

Since our goal is to get the approximate threshold where the cost for
finding to be invalidated buffers gets higher in optimized path than
the traditional path:
A. Traditional Path
 1. For each shared_buffers, compare the relfilenode.
 2. LockBufHdr()
 3. Compare block number, InvalidateBuffers() if it'

Re: Move --data-checksums to common options in initdb --help

2021-01-02 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Fri, Jan 01, 2021 at 08:34:34PM +0100, Michael Banck wrote:
> > I think enough people use data checksums these days that it warrants to
> > be moved into the "normal part", like in the attached.
> 
> +1.  Let's see first what others think about this change.

I agree with this, but I'd also like to propose, again, as has been
discussed a few times, making it the default too.

Thanks,

Stephen


signature.asc
Description: PGP signature


Why latestRemovedXid|cuteoff_xid are always sent?

2021-01-02 Thread Michail Nikolaev
Hello, hackers.

Working on some stuff, I realized I do not understand why
latestRemovedXid|cuteoff_xid (in different types of WAL records) are
sent every time they appear on the primary side.

latestRemovedXid|cuteoff_xid is used to call
ResolveRecoveryConflictWithSnapshot and cancel conflicting backend on
Standby. In some of the cases, snapshot conflict resolving is the only
work REDO does (heap_xlog_cleanup_info
 or btree_xlog_reuse_page, for example).

Could we try to somehow optimistically advance the latest sent
latestRemovedXid value in shared memory on the primary and skip
sending it if the newer xid was sent already? In such a way we could
reduce the number of ResolveRecoveryConflictWithSnapshot calls on
Standby and even skip some WAL records.

At least we could do the same optimization on the standby side
(skipping ResolveRecoveryConflictWithSnapshot if it was called with
newer xid already).

Is it a sane idea or I have missed something huge?

Thanks,
Michail.




Re: archive status ".ready" files may be created too early

2021-01-02 Thread Andrey Borodin
Hi!

I was looking to review something in CF. This seems like a thread of some 
interest to me.

Recently we had somewhat related incident. Do I understand correctly that this 
incident is related to the bug discussed in this thread?

Primary instance was killed by OOM
[ 2020-11-12 15:27:03.732 MSK ,,,739,0 ]:LOG:  server process (PID 40189) 
was terminated by signal 9: Killed
after recovery it archived some WAL segments.
[ 2020-11-12 15:27:31.477 MSK ,,,739,0 ]:LOG:  database system is ready to 
accept connections
INFO: 2020/11/12 15:27:32.059541 FILE PATH: 000E0001C02F00AF.br
INFO: 2020/11/12 15:27:32.114319 FILE PATH: 000E0001C02F00B3.br

then PITR failed on another host
[ 2020-11-12 16:26:33.024 MSK ,,,51414,0 ]:LOG:  restored log file 
"000E0001C02F00B3" from archive
[ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG:  invalid record length at 
1C02F/B3FFF778: wanted 24, got 0
[ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG:  invalid record length at 
1C02F/B3FFF778: wanted 24, got 0

archived segment has some zeroes at the end
rmgr: XLOGlen (rec/tot): 51/  1634, tx:  0, lsn: 
1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc: FPI_FOR_HINT , blkref #0: rel 
1663/14030/16384 blk 140 FPW
rmgr: Heaplen (rec/tot):129/   129, tx: 3890578935, lsn: 
1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc: HOT_UPDATE off 34 xmax 3890578935 ; 
new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140
rmgr: Transaction len (rec/tot): 46/46, tx: 3890578935, lsn: 
1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT 2020-11-12 15:27:31.507363 MSK
pg_waldump: FATAL:  error in WAL record at 1C02F/**B3FFF748**: invalid record 
length at 1C02F/**B3FFF778**: wanted 24, got 0

Meanwhile next segment points to previous record at **B3FFF748**
postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 000E0001C02F00B4| head
rmgr: Heaplen (rec/tot):129/   129, tx: 3890578936, lsn: 
1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc: HOT_UPDATE off 35 xmax 
3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140
rmgr: Transaction len (rec/tot): 46/46, tx: 3890578936, lsn: 
1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT 2020-11-12 15:27:32.509443 MSK

Best regards, Andrey Borodin.



Re: Reloptions for table access methods

2021-01-02 Thread Jeff Davis
On Wed, 2020-12-30 at 21:23 +, Simon Riggs wrote:
> But you cannot seriously argue that a one line patch that changes
> user
> visible behavior can be acceptable in Postgres core without tests or
> docs or code comments.

Hi Simon,

Often, good documented APIs come about after a few extensions gain
experience hacking things together using undocumented assumptions and
internal APIs. But in this case, extension authors have no opportunity
to hack in reloptions for a TableAM, because of this needless extra
check that my patch would eliminate.

The patch does not have any user-visible impact. To have any effect, an
extension would need to use these internal APIs in a specific way that
is not documented externally. I see my tiny patch as a tiny improvement
to the backend code because it eliminates a useless extra check, and
therefore doesn't need much justification. If others see it as a burden
on the engine code, that's a different story.

If we insist that this must be a fully-supported feature or nothing at
all, then I'll withdraw the patch. But I doubt that will result in a
proper feature for v14, it just means that when we do get around to
supporting extensible reloptions, there will be no hacks in the wild to
learn from.

Regards,
Jeff Davis






Re: pgbench: option delaying queries till connections establishment?

2021-01-02 Thread Fabien COELHO




It looks like macOS doesn't have pthread barriers (via cfbot 2021, now
with more operating systems):


Indeed:-(

I'll look into that.


Just for fun, the attached 0002 patch is a quick prototype of a
replacement for that stuff that seems to work OK on a Mac here.  (I'm
not sure if the Windows part makes sense or works.)


Thanks! That will definitely help because I do not have a Mac. I'll do 
some cleanup.


--
Fabien.




Re: WIP: BRIN multi-range indexes

2021-01-02 Thread Tomas Vondra
Hi,

On 1/2/21 7:42 AM, Thomas Munro wrote:
> On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra
>  wrote:
>> Attached is an updated version of the patch series, rebased on current
>> master, and results for benchmark comparing the various bloom variants.
> 
> Perhaps src/include/utils/inet.h needs to include ,
> because FreeBSD says:
> 
> brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET'
> if (ip_family(ipa) == PGSQL_AF_INET)
> ^
> ../../../../src/include/utils/inet.h:39:24: note: expanded from macro
> 'PGSQL_AF_INET'
> #define PGSQL_AF_INET (AF_INET + 0)
> ^

Not sure. The other files using PGSQL_AF_INET just include sys/socket.h
directly, so maybe this should just do the same thing ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Why latestRemovedXid|cuteoff_xid are always sent?

2021-01-02 Thread Peter Geoghegan
On Sat, Jan 2, 2021 at 8:00 AM Michail Nikolaev
 wrote:
> Working on some stuff, I realized I do not understand why
> latestRemovedXid|cuteoff_xid (in different types of WAL records) are
> sent every time they appear on the primary side.
>
> latestRemovedXid|cuteoff_xid is used to call
> ResolveRecoveryConflictWithSnapshot and cancel conflicting backend on
> Standby. In some of the cases, snapshot conflict resolving is the only
> work REDO does (heap_xlog_cleanup_info
>  or btree_xlog_reuse_page, for example).

But you can say the same thing about fields from certain WAL record
types, too. It's not that uncommon for code to make a conceptually
optional piece of information into a normal WAL record struct field,
even though that approach has unnecessary space overhead in the cases
that don't need the information. Often this makes hardly any
difference due to factors like alignment and the simple fact that we
don't expect very many WAL records (with or without the optional
information) in practice.

Of course, it's possible that the question of whether or not it's
worth it has been misjudged for any given case. And maybe these
particular WAL records are one such case where somebody got it wrong,
affecting a real workload (I am ignoring the complexity of making it
work for latestRemovedXid in particular for now). But I tend to doubt
that the space saving would be noticeable, from what I've seen with
pg_waldump.

> Could we try to somehow optimistically advance the latest sent
> latestRemovedXid value in shared memory on the primary and skip
> sending it if the newer xid was sent already? In such a way we could
> reduce the number of ResolveRecoveryConflictWithSnapshot calls on
> Standby and even skip some WAL records.
>
> At least we could do the same optimization on the standby side
> (skipping ResolveRecoveryConflictWithSnapshot if it was called with
> newer xid already).

Maybe it makes sense to add a fast path to
ResolveRecoveryConflictWithSnapshot() so that it falls out early
without scanning the proc array (in cases where it will still do so
today, since of course ResolveRecoveryConflictWithSnapshot() has the
obvious InvalidTransactionId fast path already).

> Is it a sane idea or I have missed something huge?

It seems like two almost distinct ideas to me. Though the important
thing in both cases is the savings in real world conditions.

-- 
Peter Geoghegan




Re: doc review for v14

2021-01-02 Thread Michael Paquier
On Tue, Dec 29, 2020 at 06:22:43PM +0900, Michael Paquier wrote:
> Yes, I have done an extra effort on those fixes where needed.  On top
> of that, I have included catalogs.sgml, pgstatstatements.sgml,
> explain.sgml, pg_verifybackup.sgml and wal.sgml in 13.

Justin, I got to look at the libxml2 part, and finished by rewording
the comment block as follows:
+* XmlTable returns a table-set of composite values.  This error context
+* is used for providing more details, and needs to be reset between two
+* internal calls of libxml2 as different error contexts might have been
+* created or used.

What do you think?
--
Michael


signature.asc
Description: PGP signature


Re: doc review for v14

2021-01-02 Thread Justin Pryzby
On Sun, Jan 03, 2021 at 03:10:54PM +0900, Michael Paquier wrote:
> On Tue, Dec 29, 2020 at 06:22:43PM +0900, Michael Paquier wrote:
> > Yes, I have done an extra effort on those fixes where needed.  On top
> > of that, I have included catalogs.sgml, pgstatstatements.sgml,
> > explain.sgml, pg_verifybackup.sgml and wal.sgml in 13.
> 
> Justin, I got to look at the libxml2 part, and finished by rewording
> the comment block as follows:
> +* XmlTable returns a table-set of composite values.  This error context
> +* is used for providing more details, and needs to be reset between two
> +* internal calls of libxml2 as different error contexts might have been
> +* created or used.

I don't like "this error context", since "this" seems to be referring to the
"tableset of composite values" as an err context.

I guess you mean: "needs to be reset between each internal call to libxml2.."

So I'd suggest:

> +* XmlTable returns a table-set of composite values.  The error context
> +* is used for providing additional detail. It needs to be reset between 
> each
> +* call to libxml2, since different error contexts might have been
> +* created or used since it was last set.


But actually, maybe we should just use the comment that exists everywhere else
for that.

/* Propagate context related error context to libxml2 */
xmlSetStructuredErrorFunc((void *) xtCxt->xmlerrcxt, xml_errorHandler);

Maybe should elaborate and say:
/*
 * Propagate context related error context to libxml2 (needs to be
 * reset before each call, in case other error contexts have been 
assigned since
 * it was first set) */
 */
xmlSetStructuredErrorFunc((void *) xtCxt->xmlerrcxt, xml_errorHandler);

-- 
Justin