Re: Asynchronous Append on postgres_fdw nodes.
On Sun, Dec 20, 2020 at 5:15 PM Etsuro Fujita wrote: > On Thu, Nov 26, 2020 at 10:28 AM movead...@highgo.ca > wrote: > > Issue one: > > Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in > > ExecAppendAsyncRequest() function when I use more than two foreign table > > on different foreign server. > > > > I research the code and do such change then the Assert problom disappear. > > Could you show a test case causing the assertion failure? I happened to reproduce the same failure in my environment. I think your change would be correct, but I changed the patch so that it doesn’t need as_lastasyncplan anymore [1]. The new version of the patch works well for my case. So, could you test your case with it? Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK17L0j6otssa53ZvjnCsjguJHZXaqPL2HU_LDoZ4ATZjEw%40mail.gmail.com
Re: Proposed patch for key management
Hello Stephen, I'm unsure about what is the "common use case". ISTM that having the postgres process holding the master key looks like a "no" for me in any use case with actual security concern: as the database must be remotely accessible, a reasonable security assumption is that a pg account could be compromised, and the "master key" (if any, that is just one particular cryptographic design) should not be accessible in that case. The first barrier would be pg admin account. With external, the options are that the key is hold by another id, so the host must be compromised as well, and could even be managed remotely on another host (I agree that this later option would be fragile because of the implied network connection, but it would be a tradeoff depending on the security concerns, but it pretty much correspond to the kerberos model). No, this isn't anything like the Kerberos model and there's no case where the PG account won't have access to the DEK (data encryption key) during normal operation (except with the possibility of off-loading to a hardware crypto device which, while interesting, is definitely something that's of very limited utility and which could be added on as a capability later anyway. This is also well understood by those who are interested in this capability and it's perfectly acceptable that PG will have, in memory, the DEK. I'm sorry that I'm not very clear. My ranting is having a KEK "master key" in pg memory. I think I'm fine at this point with having DEKs available on the host to code/decode files. What I meant about kerberos is that the setup I was describing was making the database dependent on a remote host, which is somehow what keroberos does. The keys which are stored on disk with the proposed patch are encrypted using a KEK which is external to PG- that's all part of the existing patch and design. Yep. My point is that the patch hardwires a cryptographic design with many assumptions, whereas I think it should design an API compatible with a larger class of designs, and possibly implement one with KEK and so on, I'm fine with that. Adding a pre-defined system will not prevent future work on a completely external option. Yes and no. Having two side-by-side cluster-encryption scheme in core, the first that could be implemented on top of the second without much effort, would look pretty weird. Moreover, required changes in core to allow an API are the very same as the one in this patch. The other concern I have with doing the cleaning work afterwards is that the API would be somehow in the middle of the existing functions if it has not been thought of before. This has been considered and the functions which are proposed are intentionally structured to allow for multiple implementations already, Does it? This is unclear to me from looking at the code, and the limited documentation does not point to that. I can see that the "kek provider" and the "random provider" can be changed, but the overall cryptographic design seems hardwired. so it's unclear to me what the argument here is. The argument is that professional cryptophers do wrong designs, and I do not see why you would do different. So instead of hardwiring choice that you think are THE good ones, ISTM that pg should design a reasonably flexible API, and also provide an implementation, obviously. Further, we haven't even gotten to actual cluster encryption yet- this is all just the key management side of things, With hardwired choices about 1 KEK and 2 DEK that are debatable, see below. which is absolutely a necessary and important part but argueing that it doesn't address the cluster encryption approach in core simply doesn't make any sense as that's not a part of this patch. Let's have that discussion when we actually get to the point of talking about cluster encryption. I know archive_command is completely external, but that is much simpler than what would need to be done for key management, key rotation, and key verification. I'm not sure of the design of the key rotation stuff as I recall it from the patches I looked at, but the API design looks like the more pressing issue. First, the mere existence of an "master key" is a cryptographic choice which is debatable, because it creates issues if one must be able to change it, hence the whole "key rotation" stuff. ISTM that what core needs to know is that one particular key is changed by a new one and the underlying file is rewritten. It should not need to know anything about master keys and key derivation at all. It should need to know that the user wants to change file keys, though. No, it's not debatable that a KEK is needed, I disagree entirely. ISTM that there is cryptographic design which suits your needs and you seem to decide that it is the only possible way to do it It seems that you know. As a researcher, I know that I do not know:-) As a software engineer, I'm trying to suggest enabling more with the
Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding
On Sat, Jan 02, 2021 at 12:31:45PM +0500, Andrey Borodin wrote: > Do I understand correctly that this is bugfix that needs to be back-patched? The slru-truncate-modulo patch fixes a bug. The slru-truncate-t-insurance patch does not. Neither _needs_ to be back-patched, though I'm proposing to back-patch both. I welcome opinions about that. > Thus we should not refactor 4 identical *PagePrecedes(int page1, int page2) > into 1 generic function? I agree with not refactoring that way, in this case. > Since functions are not symmetric anymore, maybe we should have better names > for arguments than "page1" and "page2"? At least in dev branch. That works for me. What names would you suggest? > Is it common practice to embed tests into assert checking like in > SlruPagePrecedesUnitTests()? No; it's neither common practice nor a policy breach.
Re: faster ETL / bulk data load for heap tables
On 02-01-2021 08:36, Amit Kapila wrote: On Fri, Jan 1, 2021 at 7:37 PM Luc Vlaming wrote: Hi, In an effort to speed up bulk data loading/transforming I noticed that considerable time is spent in the relation extension lock. We already do extend the relation in bulk when there is a contention on relation extension lock via RelationAddExtraBlocks. I wonder why is that not able to address this kind of workload. On a quick look at your patch, it seems you are always trying to extend the relation by 128 blocks for copy operation after acquiring the lock whereas the current mechanism has some smarts where it decides based on the number of waiters. Now, it is possible that we should extend by a larger number of blocks as compared to what we are doing now but using some ad-hoc number might lead to space wastage. Have you tried to fiddle with the current scheme of bulk-extension to see if that addresses some of the gains you are seeing? I see that you have made quite a few other changes that might be helping here but still, it is better to see how much bottleneck is for relation extension lock and if that can be addressed with the current mechanism rather than changing the things in a different way. Hi, Thanks for looking at the patch! Yes I tried that. I guess I should have also shared all other things I have tried before I ended up with these patches. I've tried to improve RelationAddExtraBlocks to extend the mechanism to more aggresively allocate blocks, to not put them in the FSM immediately as this also seemed like a point of contention, extending ReadBufferBI to allocate several pages in a loop, and a few more variants like this. To be sure I just tried again a few variants where I made e.g. extraBlocks=128, 256, etc, and disabled e.g. the FSM code. None of those grant very big performance gains or actually make it slower, suggesting the code is parameterized quite well already for the current design. The main problem is that the relation extension lock is taken to extend one block at a time, whilst doing (expensive) syscalls like pwrite(). Even though we then put these blocks immediately in the FSM and such, the bottleneck stays the extension of the file itself, no matter how many blocks we allocate in a loop in RelationAddExtraBlocks. Therefore what I've set out to do is: - make the relation extension lock taken as short as possible. - reduce the time spent on syscalls as much as possible. This resulted in a design which hands out blocks to a specific backend so that everything after the file extension can be done safely without locks. Given that the current API did not allow this to be specified properly for now I added extra functions which have just as purpose to extend the relation so that they could be purpose built for this. I did not want to suggest this is however the best API there could be. The current state of this patch is in that sense still very crude. I just wrote the code I needed to be able to do performance testing. So it is as you rightfully pointed out quite ad-hoc and not very generic, nor the right design, has quite some code duplication, etc. I was at this point mostly looking for feedback on the design/approach. If this requires a (much) cleaner and more productified patch then I can arrange that. However I thought to first find out if this approach makes sense at all before doing more testing to make this more generic. Kind regards, Luc
Re: faster ETL / bulk data load for heap tables
On 01-01-2021 19:55, Zhihong Yu wrote: Hi, Luc: Happy New Year. Looking at BufferAllocExtend() in v1-0002-WIP-buffer-alloc-specialized-for-relation-extensi.patch. it seems there is duplicate code with the existing BufferAlloc(). It would be good if some refactoring is done by extracting common code into a helper function. Thanks Hi, Thanks! Happy new year to you too! Thanks for your suggestion. I would wait a bit and first get some feedback on the design/approach of my patches before doing the refactoring. The current code is very much a WIP where I just copied functions to be able to make specialized variants of them aimed at bulk extension and then benchmark those. If the refactoring needs to be done before I can get feedback on the design / approach then let me know. Kind regards, Luc
Re: Proposed patch for key management
Hi Fabien On Sat, 2 Jan 2021 at 09:50, Fabien COELHO wrote: > ... > ISTM that pg at the core level should (only) directly provide: > > (1) a per-file encryption scheme, with loadable (hook-replaceable > functions??) to manage pages, maybe: > >encrypt(page_id, *key, *clear_page, *encrypted_page); >decrypt(page_id, *key, *encrypted_page, *clear_page); > > What encrypt/decrypt does is beyond pg core stuff. Ok, a reasonable > implementation should be provided, obviously, possibly in core. There may > be some block-size issues if not full pages are encrypted, so maybe the > interface needs to be a little more subtle. > There are a lot of specifics of the encryption implementation which need to be addressed in future patches. This patch focuses on making keys available to the encryption processes at run-time, so ... > > (2) offer a key management scheme interface, to manage *per-file* keys, > possibly loadable (hook replaceable?). If all files have the same key, > which is stored in a directory and encoded with a KEK, this is just one > (debatable) implementation choice. For that, ISTM that what is needed at > this level is: > >get_key(file_id (relative name? oid? 8 or 16 bytes something?)); > Per-cluster keys for permanent data and WAL allow a useful level of protection, even if it could be improved upon. It's also going to be quicker/simpler to implement, so any API should allow for it. If there's an arbitrary number of DEK's, using a scope label for accessing them sounds right, so "WAL", "local_data", "local_data/tablespaceiod" or "local_data/dboid/tableoid". > ... > (3) ISTM that the key management interface should be external, or at least > it should be possible to make it external easily. I do not think that > there is a significant performance issue because keys are needed once, and > once loaded they are there. A simple way to do that is a separate process > with a basic protocol on stdin/stdout to implement "get_key", which is > basically already half implemented in the patch for retrieving the KEK. > If keys can have arbitrary scope, then the pg backend won't know what to ask for. So the API becomes even simpler with no specific request on stdin and all the relevant keys on stdout. I generally like this approach as well, and it will be the only option for some integrations. On the other hand, there is an advantage to having the key retrieval piece of key management in-process - the keys are not being passed around in plain. There is also a further validation task - probably beyond the scope of the key management patch and into the encryption patch[es] territory - checking that the keys supplied are the same keys in use for the data currently on disk. It feels to me like this should be done at startup, rather than as each file is accessed, which could make startup quite slow if there are a lot of keys with narrow scope. Regards Alastair
Re: [HACKERS] [PATCH] Generic type subscripting
> On Thu, Dec 31, 2020 at 08:21:55PM +0100, Pavel Stehule wrote: > čt 31. 12. 2020 v 15:27 odesílatel Dmitry Dolgov <9erthali...@gmail.com> > napsal: > > the tests passed and filling gaps works well > > but creating empty objects doesn't work > > create table foo(a jsonb); > > insert into foo values('{}'); > > postgres=# update foo set a['k'][1] = '20'; > UPDATE 1 > postgres=# select * from foo; > ┌───┐ > │ a │ > ╞═══╡ > │ {"k": [null, 20]} │ > └───┘ > (1 row) > > it is ok > > postgres=# update foo set a['k3'][10] = '20'; > UPDATE 1 > postgres=# select * from foo; > ┌───┐ > │ a │ > ╞═══╡ > │ {"k": [null, 20]} │ > └───┘ > (1 row) > > the second update was not successful Right, it was working only if the source level is empty, thanks for checking. I've found a bit more time and prepared more decent version which covers all the cases I could come up with following the same implementation logic. The first patch is the same though. >From c9143a620497dac5615c4de1d9349684e9af95b5 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Fri, 18 Dec 2020 17:19:51 +0100 Subject: [PATCH v41 1/2] Subscripting for jsonb Subscripting implementation for jsonb. It does not support slices, does not have a limit for number of subscripts and for assignment expects a replace value to be of jsonb type. There is also one functional difference in assignment via subscripting from jsonb_set, when an original jsonb container is NULL, subscripting replaces it with an empty jsonb and proceed with assignment. For the sake of code reuse, some parts of jsonb functionality were rearranged to allow use the same functions for jsonb_set and assign subscripting operation. The original idea belongs to Oleg Bartunov. Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule --- doc/src/sgml/json.sgml | 48 src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/jsonb_util.c | 76 - src/backend/utils/adt/jsonbsubs.c | 413 src/backend/utils/adt/jsonfuncs.c | 180 ++-- src/include/catalog/pg_proc.dat | 4 + src/include/catalog/pg_type.dat | 3 +- src/include/utils/jsonb.h | 6 +- src/test/regress/expected/jsonb.out | 272 +- src/test/regress/sql/jsonb.sql | 84 +- 10 files changed, 982 insertions(+), 105 deletions(-) create mode 100644 src/backend/utils/adt/jsonbsubs.c diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 5b9a5557a4..100d1a60f4 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -602,6 +602,54 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu + + jsonb Subscripting + + jsonb data type supports array-style subscripting expressions + to extract or update particular elements. It's possible to use multiple + subscripting expressions to extract nested values. In this case, a chain of + subscripting expressions follows the same rules as the + path argument in jsonb_set function, + e.g. in case of arrays it is a 0-based operation or that negative integers + that appear in path count from the end of JSON arrays. + The result of subscripting expressions is always jsonb data type. An + example of subscripting syntax: + +-- Extract value by key +SELECT ('{"a": 1}'::jsonb)['a']; + +-- Extract nested value by key path +SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c']; + +-- Extract element by index +SELECT ('[1, "2", null]'::jsonb)[1]; + +-- Update value by key, note the single quotes - the assigned value +-- needs to be of jsonb type as well +UPDATE table_name SET jsonb_field['key'] = '1'; + +-- Select records using where clause with subscripting. Since the result of +-- subscripting is jsonb and we basically want to compare two jsonb objects, we +-- need to put the value in double quotes to be able to convert it to jsonb. +SELECT * FROM table_name WHERE jsonb_field['key'] = '"value"'; + + + Subscripting for jsonb does not support slice expressions, + even if it contains an array. + + In case if source jsonb is NULL, assignment + via subscripting will proceed as if it was an empty JSON object: + +-- If jsonb_field here is NULL, the result is {"a": 1} +UPDATE table_name SET jsonb_field['a'] = '1'; + +-- If jsonb_field here is NULL, the result is [1] +UPDATE table_name SET jsonb_field[0] = '1'; + + + + + Transforms diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 82732146d3..279ff15ade 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -50,6 +50,7 @@ OBJS = \ jsonb_op.o \ jsonb_util.o \ jsonfuncs.o \ + jsonbsubs.o \ jsonpath.o \ jsonpath_exec.o \ jsonpath_gram.o \ diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 4eeffa1424..41a1c1f9bb 100644 --- a
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wednesday, December 30, 2020 8:58 PM, Amit Kapila wrote: > On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying > wrote: > > > > Hi Amit, > > > > In last > > > mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e625718 > 2 > > 564b6%40G08CNEXMBPEKD05.g08.fujitsu.local), > > I've sent you the performance test results(run only 1 time) on single table. > Here is my the retested results(average by 15 times) which I think is more > accurate. > > > > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is > nonlinear(also include test results on shared buffers of 50G/60G), so it's a > little difficult to decide the threshold from the two for me. > > If just consider 100G, I think NBuffers/32 is the optimized max relation > > size. > But I don't know how to judge for 20G. If you have any suggestion, kindly let > me know. > > > > Considering these results NBuffers/64 seems a good threshold as beyond > that there is no big advantage. BTW, it is not clear why the advantage for > single table is not as big as multiple tables with the Truncate command. Can > you share your exact test steps for any one of the tests? > Also, did you change autovacumm = off for these tests, if not then the results > might not be reliable because before you run the test via Vacuum command > autovacuum would have done that work? Happy new year. The V38 LGTM. Apologies for a bit of delay on posting the test results, but since it's the start of commitfest, here it goes and the results were interesting. I executed a VACUUM test using the same approach that Tsunakawa-san did in [1], but only this time, the total time that DropRelFileNodeBuffers() took. I used only a single relation, tried with various sizes using the values of threshold: NBuffers/512..NBuffers/1, as advised by Amit. Example of relation sizes for NBuffers/512. 100GB shared_buffers: 200 MB 20GB shared_buffers: 40 MB 1GB shared_buffers:2 MB 128MB shared_buffers: 0.25 MB The regression, which means the patch performs worse than master, only happens for relation size NBuffers/2 and below for all shared_buffers. The fastest performance on a single relation was using the relation size NBuffers/512. [VACUUM Recovery Performance on Single Relation] Legend: P_XXX (Patch, NBuffers/XXX relation size), M_XXX (Master, Nbuffers/XXX relation size) Unit: seconds | Rel Size | 100 GB s_b | 20 GB s_b | 1 GB s_b | 128 MB s_b | |--||||| | P_512| 0.012594 | 0.001989 | 0.81 | 0.12 | | M_512| 0.208757 | 0.046212 | 0.002013 | 0.000295 | | P_256| 0.026311 | 0.004416 | 0.000129 | 0.21 | | M_256| 0.241017 | 0.047234 | 0.002363 | 0.000298 | | P_128| 0.044684 | 0.009784 | 0.000290 | 0.42 | | M_128| 0.253588 | 0.047952 | 0.002454 | 0.000319 | | P_64 | 0.065806 | 0.017444 | 0.000521 | 0.75 | | M_64 | 0.268311 | 0.050361 | 0.002730 | 0.000339 | | P_32 | 0.121441 | 0.033431 | 0.001646 | 0.000112 | | M_32 | 0.285254 | 0.061486 | 0.003640 | 0.000364 | | P_16 | 0.255503 | 0.065492 | 0.001663 | 0.000144 | | M_16 | 0.377013 | 0.081613 | 0.003731 | 0.000372 | | P_8 | 0.560616 | 0.109509 | 0.005954 | 0.000465 | | M_8 | 0.692596 | 0.112178 | 0.006667 | 0.000553 | | P_4 | 1.109437 | 0.162924 | 0.011229 | 0.000861 | | M_4 | 1.162125 | 0.178764 | 0.011635 | 0.000935 | | P_2 | 2.202231 | 0.317832 | 0.020783 | 0.002646 | | M_2 | 1.583959 | 0.306269 | 0.015705 | 0.002021 | | P_1 | 3.080032 | 0.632747 | 0.032183 | 0.002660 | | M_1 | 2.705485 | 0.543970 | 0.030658 | 0.001941 | %reg = (Patched/Master)/Patched | %reg_relsize | 100 GB s_b | 20 GB s_b | 1 GB s_b | 128 MB s_b | |--||||| | %reg_512 | -1557.587% | -2223.006% | -2385.185% | -2354.167% | | %reg_256 | -816.041% | -969.691% | -1731.783% | -1319.048% | | %reg_128 | -467.514% | -390.123% | -747.008% | -658.333% | | %reg_64 | -307.727% | -188.704% | -423.992% | -352.000% | | %reg_32 | -134.891% | -83.920% | -121.097% | -225.970% | | %reg_16 | -47.557% | -24.614% | -124.279% | -157.390% | | %reg_8 | -23.542% | -2.437%| -11.967% | -19.010% | | %reg_4 | -4.749%| -9.722%| -3.608%| -8.595%| | %reg_2 | 28.075%| 3.638% | 24.436%| 23.615%| | %reg_1 | 12.160%| 14.030%| 4.739% | 27.010%| Since our goal is to get the approximate threshold where the cost for finding to be invalidated buffers gets higher in optimized path than the traditional path: A. Traditional Path 1. For each shared_buffers, compare the relfilenode. 2. LockBufHdr() 3. Compare block number, InvalidateBuffers() if it'
Re: Move --data-checksums to common options in initdb --help
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > On Fri, Jan 01, 2021 at 08:34:34PM +0100, Michael Banck wrote: > > I think enough people use data checksums these days that it warrants to > > be moved into the "normal part", like in the attached. > > +1. Let's see first what others think about this change. I agree with this, but I'd also like to propose, again, as has been discussed a few times, making it the default too. Thanks, Stephen signature.asc Description: PGP signature
Why latestRemovedXid|cuteoff_xid are always sent?
Hello, hackers. Working on some stuff, I realized I do not understand why latestRemovedXid|cuteoff_xid (in different types of WAL records) are sent every time they appear on the primary side. latestRemovedXid|cuteoff_xid is used to call ResolveRecoveryConflictWithSnapshot and cancel conflicting backend on Standby. In some of the cases, snapshot conflict resolving is the only work REDO does (heap_xlog_cleanup_info or btree_xlog_reuse_page, for example). Could we try to somehow optimistically advance the latest sent latestRemovedXid value in shared memory on the primary and skip sending it if the newer xid was sent already? In such a way we could reduce the number of ResolveRecoveryConflictWithSnapshot calls on Standby and even skip some WAL records. At least we could do the same optimization on the standby side (skipping ResolveRecoveryConflictWithSnapshot if it was called with newer xid already). Is it a sane idea or I have missed something huge? Thanks, Michail.
Re: archive status ".ready" files may be created too early
Hi! I was looking to review something in CF. This seems like a thread of some interest to me. Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussed in this thread? Primary instance was killed by OOM [ 2020-11-12 15:27:03.732 MSK ,,,739,0 ]:LOG: server process (PID 40189) was terminated by signal 9: Killed after recovery it archived some WAL segments. [ 2020-11-12 15:27:31.477 MSK ,,,739,0 ]:LOG: database system is ready to accept connections INFO: 2020/11/12 15:27:32.059541 FILE PATH: 000E0001C02F00AF.br INFO: 2020/11/12 15:27:32.114319 FILE PATH: 000E0001C02F00B3.br then PITR failed on another host [ 2020-11-12 16:26:33.024 MSK ,,,51414,0 ]:LOG: restored log file "000E0001C02F00B3" from archive [ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0 [ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0 archived segment has some zeroes at the end rmgr: XLOGlen (rec/tot): 51/ 1634, tx: 0, lsn: 1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc: FPI_FOR_HINT , blkref #0: rel 1663/14030/16384 blk 140 FPW rmgr: Heaplen (rec/tot):129/ 129, tx: 3890578935, lsn: 1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc: HOT_UPDATE off 34 xmax 3890578935 ; new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 rmgr: Transaction len (rec/tot): 46/46, tx: 3890578935, lsn: 1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT 2020-11-12 15:27:31.507363 MSK pg_waldump: FATAL: error in WAL record at 1C02F/**B3FFF748**: invalid record length at 1C02F/**B3FFF778**: wanted 24, got 0 Meanwhile next segment points to previous record at **B3FFF748** postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 000E0001C02F00B4| head rmgr: Heaplen (rec/tot):129/ 129, tx: 3890578936, lsn: 1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc: HOT_UPDATE off 35 xmax 3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 rmgr: Transaction len (rec/tot): 46/46, tx: 3890578936, lsn: 1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT 2020-11-12 15:27:32.509443 MSK Best regards, Andrey Borodin.
Re: Reloptions for table access methods
On Wed, 2020-12-30 at 21:23 +, Simon Riggs wrote: > But you cannot seriously argue that a one line patch that changes > user > visible behavior can be acceptable in Postgres core without tests or > docs or code comments. Hi Simon, Often, good documented APIs come about after a few extensions gain experience hacking things together using undocumented assumptions and internal APIs. But in this case, extension authors have no opportunity to hack in reloptions for a TableAM, because of this needless extra check that my patch would eliminate. The patch does not have any user-visible impact. To have any effect, an extension would need to use these internal APIs in a specific way that is not documented externally. I see my tiny patch as a tiny improvement to the backend code because it eliminates a useless extra check, and therefore doesn't need much justification. If others see it as a burden on the engine code, that's a different story. If we insist that this must be a fully-supported feature or nothing at all, then I'll withdraw the patch. But I doubt that will result in a proper feature for v14, it just means that when we do get around to supporting extensible reloptions, there will be no hacks in the wild to learn from. Regards, Jeff Davis
Re: pgbench: option delaying queries till connections establishment?
It looks like macOS doesn't have pthread barriers (via cfbot 2021, now with more operating systems): Indeed:-( I'll look into that. Just for fun, the attached 0002 patch is a quick prototype of a replacement for that stuff that seems to work OK on a Mac here. (I'm not sure if the Windows part makes sense or works.) Thanks! That will definitely help because I do not have a Mac. I'll do some cleanup. -- Fabien.
Re: WIP: BRIN multi-range indexes
Hi, On 1/2/21 7:42 AM, Thomas Munro wrote: > On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra > wrote: >> Attached is an updated version of the patch series, rebased on current >> master, and results for benchmark comparing the various bloom variants. > > Perhaps src/include/utils/inet.h needs to include , > because FreeBSD says: > > brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET' > if (ip_family(ipa) == PGSQL_AF_INET) > ^ > ../../../../src/include/utils/inet.h:39:24: note: expanded from macro > 'PGSQL_AF_INET' > #define PGSQL_AF_INET (AF_INET + 0) > ^ Not sure. The other files using PGSQL_AF_INET just include sys/socket.h directly, so maybe this should just do the same thing ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why latestRemovedXid|cuteoff_xid are always sent?
On Sat, Jan 2, 2021 at 8:00 AM Michail Nikolaev wrote: > Working on some stuff, I realized I do not understand why > latestRemovedXid|cuteoff_xid (in different types of WAL records) are > sent every time they appear on the primary side. > > latestRemovedXid|cuteoff_xid is used to call > ResolveRecoveryConflictWithSnapshot and cancel conflicting backend on > Standby. In some of the cases, snapshot conflict resolving is the only > work REDO does (heap_xlog_cleanup_info > or btree_xlog_reuse_page, for example). But you can say the same thing about fields from certain WAL record types, too. It's not that uncommon for code to make a conceptually optional piece of information into a normal WAL record struct field, even though that approach has unnecessary space overhead in the cases that don't need the information. Often this makes hardly any difference due to factors like alignment and the simple fact that we don't expect very many WAL records (with or without the optional information) in practice. Of course, it's possible that the question of whether or not it's worth it has been misjudged for any given case. And maybe these particular WAL records are one such case where somebody got it wrong, affecting a real workload (I am ignoring the complexity of making it work for latestRemovedXid in particular for now). But I tend to doubt that the space saving would be noticeable, from what I've seen with pg_waldump. > Could we try to somehow optimistically advance the latest sent > latestRemovedXid value in shared memory on the primary and skip > sending it if the newer xid was sent already? In such a way we could > reduce the number of ResolveRecoveryConflictWithSnapshot calls on > Standby and even skip some WAL records. > > At least we could do the same optimization on the standby side > (skipping ResolveRecoveryConflictWithSnapshot if it was called with > newer xid already). Maybe it makes sense to add a fast path to ResolveRecoveryConflictWithSnapshot() so that it falls out early without scanning the proc array (in cases where it will still do so today, since of course ResolveRecoveryConflictWithSnapshot() has the obvious InvalidTransactionId fast path already). > Is it a sane idea or I have missed something huge? It seems like two almost distinct ideas to me. Though the important thing in both cases is the savings in real world conditions. -- Peter Geoghegan
Re: doc review for v14
On Tue, Dec 29, 2020 at 06:22:43PM +0900, Michael Paquier wrote: > Yes, I have done an extra effort on those fixes where needed. On top > of that, I have included catalogs.sgml, pgstatstatements.sgml, > explain.sgml, pg_verifybackup.sgml and wal.sgml in 13. Justin, I got to look at the libxml2 part, and finished by rewording the comment block as follows: +* XmlTable returns a table-set of composite values. This error context +* is used for providing more details, and needs to be reset between two +* internal calls of libxml2 as different error contexts might have been +* created or used. What do you think? -- Michael signature.asc Description: PGP signature
Re: doc review for v14
On Sun, Jan 03, 2021 at 03:10:54PM +0900, Michael Paquier wrote: > On Tue, Dec 29, 2020 at 06:22:43PM +0900, Michael Paquier wrote: > > Yes, I have done an extra effort on those fixes where needed. On top > > of that, I have included catalogs.sgml, pgstatstatements.sgml, > > explain.sgml, pg_verifybackup.sgml and wal.sgml in 13. > > Justin, I got to look at the libxml2 part, and finished by rewording > the comment block as follows: > +* XmlTable returns a table-set of composite values. This error context > +* is used for providing more details, and needs to be reset between two > +* internal calls of libxml2 as different error contexts might have been > +* created or used. I don't like "this error context", since "this" seems to be referring to the "tableset of composite values" as an err context. I guess you mean: "needs to be reset between each internal call to libxml2.." So I'd suggest: > +* XmlTable returns a table-set of composite values. The error context > +* is used for providing additional detail. It needs to be reset between > each > +* call to libxml2, since different error contexts might have been > +* created or used since it was last set. But actually, maybe we should just use the comment that exists everywhere else for that. /* Propagate context related error context to libxml2 */ xmlSetStructuredErrorFunc((void *) xtCxt->xmlerrcxt, xml_errorHandler); Maybe should elaborate and say: /* * Propagate context related error context to libxml2 (needs to be * reset before each call, in case other error contexts have been assigned since * it was first set) */ */ xmlSetStructuredErrorFunc((void *) xtCxt->xmlerrcxt, xml_errorHandler); -- Justin