[PATCH] Faster, easier valgrind runs with make USE_VALGRIND=1 check
Hi all The attached patch adds support for running any temp-install based tests (check, isolationcheck, src/test/recovery, etc) under the control of valgrind with a simple make USE_VALGRIND=1 check It's based on a script I've been using for some time to run faster, simpler Valgrind checks on the codebase with much less irrelevant noise than the usual approaches. There are no C code changes at all in this patch, it only touches Makefile.global and adds a new src/tools/valgrind_wrapper tool. When you specify USE_VALGRIND=1, the PATH used by $(with_temp_install) is prefixed with a tmp_install/bin_valgrind_wrapper/ directory. Each binary in $(bindir) gets a corresponding wrapper script in bin_valgrind_wrapper in the temp install. The wrappers intercept execution of every command in the bindir and exec them under the control of valgrind (or skip valgrind and exec that target directly, if desired This has many advantages over the usual approaches of an installcheck-based valgrind run or "valgrind make check": * There's no custom setup, it works out of the box * It produces much less irrelevant log output and runs a lot faster because it only runs postgres-related binaries under valgrind, not irrelevant noise like perl interpreters, make, shellscripts, etc. * It's much more targeted and selective - if you're only interested in some extension or new backend feature, you can trivially set it to target just the backend, skip checking of initdb, and skip checking of psql, so you get more relevant log output and faster runs. I'll follow up to this post with some timing and log output numbers but wanted to share what I had first. -DUSE_VALGRIND is also added to CFLAGS at compile time when USE_VALGRIND=1 is passed to make. This gets rid of the need to hack pg_config_manual.h or fiddle with configure re-runs when you want to build with valgrind support. Arguably it'd be better to add a --enable-valgrind option to configure. LMK if that's preferable. Note that there's a bit of a hack in the wrapper script to work around Valgrind's inability to set the argv[0] of a process run under valgrind to anything other than the exact command-name to be executed. I have a patch for valgrind pending to add that capability (like "exec -a" in bash) but a workaround is necessary for now. It's made a bit more complicated by PostgreSQL's determination to canonicalize paths and follow symlinks in find_my_exec(). The script's hardlink based workarounds for this could be removed if we could agree to support a debug env-var or command-line option that could be used to supply an override path to be returned by find_my_exec() instead of performing normal discovery. If you'd prefer that approach to the current workaround in the script let me know. I'm also willing to add valgrind-support-detection logic that will cause valgrind launched via "make USE_VALGRIND=1" to refuse to run if it detects that the target postgres was not built with -DUSE_VALGRIND for proper instrumentation. This can be done with the valgrind --require-text-symbol option and a dummy export symbol added to the symbol table only when compiled with -DUSE_VALGRIND. If that's desirable let me know, it should be quick to add. You can find more detail in the patch commit message (attached) and in the src/test/valgrind_wrapper script it adds. If you're wondering why the valgrind options --trace-children=yes --trace-children-skip=pattern --trace-children-skip-by-arg=pattern don't solve this problem, read the script's comments. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise From e730e862a0731dc3d2786c5004a146aff7dd6bf7 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Tue, 4 May 2021 10:34:01 +0800 Subject: [PATCH v3] Make Valgrind runs simpler with make USE_VALGRIND=1 To run Valgrind based tests on postgres, it's generally been necessary to either: * Initdb and start your own cluster under the control of valgrind then use "make installcheck". This works won't work with TAP tests, and it's cumbersome to set up and run. -or- * Run "make check" under the control of valgrind. This approach runs all the uninteresting processes under valgrind, with all the associated overhead. Every make, shell, utility command, perl interpreter, etc gets run under valgrind, which slows everything down a lot and produces a great deal of uninteresting valgrind output. There's no practical way to target just the server backend this way. Provide a faster, simpler and more targeted way to run valgrind on postgres by adding support for a "USE_VALGRIND=1" parameter to the makefiles. This has two main effects: * It adds -DUSE_VALGRIND to CFLAGS, so enhanced valgrind runtime support is compiled into postgres; and * It interposes a wrapper script before all executions of any binary installed in the $(bindir) of a temp-install. The wrapper script invokes the real binary under valgrind. The valgrind i
Re: Small issues with CREATE TABLE COMPRESSION
On Wed, May 5, 2021 at 11:02 AM Dilip Kumar wrote: > > On Wed, May 5, 2021 at 12:06 AM Robert Haas wrote: > > > > > > There are no tests in pg_dump to make sure that some ALTER > > > MATERIALIZED VIEW or ALTER TABLE commands are generated when the > > > compression of a matview's or table's column is changed. > > > > True, but it does seem to work. I am happy if you or anyone want to > > write some tests. > > I think it will be really hard to generate such a test in pg_dump, > because default we are compiling --without-lz4, which means we have > only one compression option available, and if there is only one option > available then the column compression method and the default > compression method will be same so the dump will not generate an extra > command of type ALTER TABLE... SET COMPRESSION. I think we already have such test cases at least through pg_upgrade. Basically, if you see in compression.sql we are not dropping the table so that pg_upgrade and dump them and test. So if test run --with-lz4 then in pg_upgrade dump we can see ALTER TABLE... SET COMPRESSION type of commands. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: Use simplehash.h instead of dynahash in SMgr
Hi David, Alvaro, -hackers > Hi David, > > You're probably aware of this, but just to make it explicit: Jakub Wartak was > testing performance of recovery, and one of the bottlenecks he found in > some of his cases was dynahash as used by SMgr. It seems quite possible > that this work would benefit some of his test workloads. I might be a little bit out of the loop, but as Alvaro stated - Thomas did plenty of excellent job related to recovery performance in that thread. In my humble opinion and if I'm not mistaken (I'm speculating here) it might be *not* how Smgr hash works, but how often it is being exercised and that would also explain relatively lower than expected(?) gains here. There are at least two very important emails from him that I'm aware that are touching the topic of reordering/compacting/batching calls to Smgr: https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com Another potential option that we've discussed is that the redo generation itself is likely a brake of efficient recovery performance today (e.g. INSERT-SELECT on table with indexes, generates interleaved WAL records that touch often limited set of blocks that usually put Smgr into spotlight). -Jakub Wartak.
Re: few ideas for pgbench
Hello Pavel, This is not a simple question. Personally I prefer to show this info every time, although it can be redundant. Just for check and for more simple automatic processing. When I run pgbench, I usually work with more releases together, so the server version is important info. Ok. Yes. 2. can ve generate some output in structured format - XML, JSON ? It is obviously possible, but that would mean some code. ISTM that the various outputs are easy enough to parse and convert to anything without needing a special format? Is there some particular part you have in mind? I thought about something what I can simply import to Postgres or to R. But maybe XML or JSON is a bad idea. What about CSV? Any run can produce one row. Yep, CSV is simple and nice. It depends on what information you would like. For instance for progress report (-P 1) or logs/sampling (-l) would be relevant candidates for CSV. Not so much for the final report, though. -- Fabien.
Re: few ideas for pgbench
st 5. 5. 2021 v 11:55 odesílatel Fabien COELHO napsal: > > Hello Pavel, > > > This is not a simple question. Personally I prefer to show this info > every > > time, although it can be redundant. Just for check and for more simple > > automatic processing. > > > > When I run pgbench, I usually work with more releases together, so the > > server version is important info. > > Ok. Yes. > > >>> 2. can ve generate some output in structured format - XML, JSON ? > >> > >> It is obviously possible, but that would mean some code. ISTM that the > >> various outputs are easy enough to parse and convert to anything without > >> needing a special format? Is there some particular part you have in > mind? > >> > > > > I thought about something what I can simply import to Postgres or to R. > > But maybe XML or JSON is a bad idea. > > > > What about CSV? Any run can produce one row. > > Yep, CSV is simple and nice. It depends on what information you would > like. For instance for progress report (-P 1) or logs/sampling (-l) would > be relevant candidates for CSV. Not so much for the final report, though. > I think so there can be almost all information. We have to ensure consistency of columns. The basic usage can be for do pg_bench ... >> logfile done and log file can looks like start time, rowno, serverver, clientver, connections, scale, readonly, jobs, tps, latency, ... The header row can be optional > > -- > Fabien. >
Re: Small issues with CREATE TABLE COMPRESSION
On Wed, May 05, 2021 at 01:41:03PM +0530, Dilip Kumar wrote: > I think we already have such test cases at least through pg_upgrade. > Basically, if you see in compression.sql we are not dropping the table > so that pg_upgrade and dump them and test. So if test run --with-lz4 > then in pg_upgrade dump we can see ALTER TABLE... SET COMPRESSION type > of commands. The TAP tests of pg_dump are much more picky than what pg_upgrade is able to do. With the existing set of tests in place, what you are able to detect is that pg_upgrade does not *break* if there are tables with attributes using various compression types, but that would not be enough to make sure that the correct compression method is *applied* depending on the context expected (default_toast_compression + the attribute compression + pg_dump options), which is what the TAP tests of pg_dump are able to correctly detect if extended in an appropriate way. With what's on HEAD, we would easily miss any bugs introduced in pg_dump that change the set of commands generated depending on the options given by a user, but still allow pg_upgrade to work correctly. For example, there could be issues where we finish by setting up the incorrect compression option, with pg_upgrade happily finishing. There is a gap in the test coverage here. -- Michael signature.asc Description: PGP signature
RFC: Detailed reorder buffer stats dumps
Hi all I'm thinking of piggy-backing on the approach used in the "Get memory contexts of an arbitrary backend process" patch in order to provide access to detailed reorder buffer content statistics from walsenders on request. Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or on dynamic probes (perf, systemtap) to see what it's doing. I intend a patch soon to add a couple of fields to struct WalSnd to report some very coarse reorder buffer stats - at least oldest buffered xid, number of buffered txns, total bytes of buffered txns in memory, total bytes of buffered txns spilled to disk. But sometimes what I really want is details on the txns that're in the reorder buffer, and that's not feasible to export via always-enabled reporting like struct WalSnd. So I'm thinking that the same approach used for the memory context stats patch might work well for asking the walsender for a detailed dump of reorder buffer contents. Something like a per-buffered-txn table of txn topxid, start-lsn, most recent change lsn, number of changes, number of subxids, number of invalidations, number of catalog changes, buffer size in memory, buffer size spilled to disk. Anyone drastically opposed to the idea? (I know I have to finish up with the LWLock tracepoint patchset first, this is a RFC at this stage). -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Hello, I found this in the documentation, section '5.11.3. Partitioning Using Inheritance'[1]: "Some operations require a stronger lock when using declarative partitioning than when using table inheritance. For example, removing a partition from a partitioned table requires taking an ACCESS EXCLUSIVE lock on the parent table, whereas a SHARE UPDATE EXCLUSIVE lock is enough in the case of regular inheritance." This point is no longer valid with some restrictions. If the table has a default partition, then removing a partition still requires taking an ACCESS EXCLUSIVE lock. May be make sense to add some details about DETACH CONCURRENTLY to the section '5.11.2.2. Partition Maintenance' and completely remove this point? 1. https://www.postgresql.org/docs/devel/ddl-partitioning.html#DDL-PARTITIONING-USING-INHERITANCE Pavel Luzanov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Re: Small issues with CREATE TABLE COMPRESSION
On Wed, May 5, 2021 at 4:00 PM Michael Paquier wrote: > > On Wed, May 05, 2021 at 01:41:03PM +0530, Dilip Kumar wrote: > > I think we already have such test cases at least through pg_upgrade. > > Basically, if you see in compression.sql we are not dropping the table > > so that pg_upgrade and dump them and test. So if test run --with-lz4 > > then in pg_upgrade dump we can see ALTER TABLE... SET COMPRESSION type > > of commands. > > The TAP tests of pg_dump are much more picky than what pg_upgrade is > able to do. With the existing set of tests in place, what you are > able to detect is that pg_upgrade does not *break* if there are tables > with attributes using various compression types, but that would not be > enough to make sure that the correct compression method is *applied* > depending on the context expected (default_toast_compression + the > attribute compression + pg_dump options), which is what the TAP tests > of pg_dump are able to correctly detect if extended in an appropriate > way. Okay, got your point. > With what's on HEAD, we would easily miss any bugs introduced in > pg_dump that change the set of commands generated depending on the > options given by a user, but still allow pg_upgrade to work correctly. > For example, there could be issues where we finish by setting up the > incorrect compression option, with pg_upgrade happily finishing. > There is a gap in the test coverage here. Basically, the problem is default compilation is --without-lz4 so by default there is only one compression method and with only one compression method we can not generate the test case you asked for, because that will be the default compression method and we don't dump the default compression method. So basically, if we have to write this test case in pg_dump then we will have to use lz4 which means it will generate different output --with-lz4 vs --without-lz4. With a simple regress test it easy to deal with such cases by keeping multiple .out files but I am not sure can we do this easily with pg_dump test without adding much complexity? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Use simplehash.h instead of dynahash in SMgr
Hi Jakub, On Wed, 5 May 2021 at 20:16, Jakub Wartak wrote: > I might be a little bit out of the loop, but as Alvaro stated - Thomas did > plenty of excellent job related to recovery performance in that thread. In my > humble opinion and if I'm not mistaken (I'm speculating here) it might be > *not* how Smgr hash works, but how often it is being exercised and that would > also explain relatively lower than expected(?) gains here. There are at least > two very important emails from him that I'm aware that are touching the topic > of reordering/compacting/batching calls to Smgr: > https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com > https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com I'm not much of an expert here and I didn't follow the recovery prefetching stuff closely. So, with that in mind, I think there are lots that could be done along the lines of what Thomas is mentioning. Batching WAL records up by filenode then replaying each filenode one by one when our batching buffer is full. There could be some sort of parallel options there too, where workers replay a filenode each. However, that wouldn't really work for recovery on a hot-standby though. We'd need to ensure we replay the commit record for each transaction last. I think you'd have to batch by filenode and transaction in that case. Each batch might be pretty small on a typical OLTP workload, so it might not help much there, or it might hinder. But having said that, I don't think any of those possibilities should stop us speeding up smgropen(). > Another potential option that we've discussed is that the redo generation > itself is likely a brake of efficient recovery performance today (e.g. > INSERT-SELECT on table with indexes, generates interleaved WAL records that > touch often limited set of blocks that usually put Smgr into spotlight). I'm not quite sure if I understand what you mean here. Is this queuing up WAL records up during transactions and flush them out to WAL every so often after rearranging them into an order that's more optimal for replay? David
Re: Performance degradation of REFRESH MATERIALIZED VIEW
On 4/27/21 8:22 PM, Andres Freund wrote: Hi, On 2021-04-28 00:44:47 +0900, Masahiko Sawada wrote: On Wed, Apr 28, 2021 at 12:26 AM Masahiko Sawada wrote: What Andres is suggesting (I think) is to modify ExecInsert() to pass a valid bistate to table_tuple_insert, instead of just NULL, and store the vmbuffer in it. Understood. This approach keeps using the same vmbuffer until we need another vm page corresponding to the target heap page, which seems better. But how is ExecInsert() related to REFRESH MATERIALIZED VIEW? I was thinking of the CONCURRENTLY path for REFRESH MATERIALIZED VIEW I think. Or something. That actually makes it easier - we already pass in a bistate in the relevant paths. So if we add a current_vmbuf to BulkInsertStateData, we can avoid needing to pin so often. It seems that'd end up with a good bit cleaner and less risky code than the skip_vmbuffer_for_frozen_tuple_insertion_v3.patch approach. The current RelationGetBufferForTuple() interface / how it's used in heapam.c doesn't make this quite as trivial as it could be... Attached is a quick hack implementing this. For me it reduces the overhead noticably: REFRESH MATERIALIZED VIEW mv; before: Time: 26542.333 ms (00:26.542) after: Time: 23105.047 ms (00:23.105) Thanks, that looks promising. I repeated the tests I did on 26/4, and the results look like this: old (0c7d3bb99): 497ms master: 621ms patched: 531ms So yeah, that's a bit improvement - it does not remove the regression entirely, but +5% is much better than +25%. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: Detailed reorder buffer stats dumps
On Wed, May 5, 2021 at 4:03 PM Craig Ringer wrote: > > Hi all > > I'm thinking of piggy-backing on the approach used in the "Get memory > contexts of an arbitrary backend process" patch in order to provide access to > detailed reorder buffer content statistics from walsenders on request. > > Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or > on dynamic probes (perf, systemtap) to see what it's doing. I intend a patch > soon to add a couple of fields to struct WalSnd to report some very coarse > reorder buffer stats - at least oldest buffered xid, number of buffered txns, > total bytes of buffered txns in memory, total bytes of buffered txns spilled > to disk. > > But sometimes what I really want is details on the txns that're in the > reorder buffer, and that's not feasible to export via always-enabled > reporting like struct WalSnd. So I'm thinking that the same approach used for > the memory context stats patch might work well for asking the walsender for a > detailed dump of reorder buffer contents. Something like a per-buffered-txn > table of txn topxid, start-lsn, most recent change lsn, number of changes, > number of subxids, number of invalidations, number of catalog changes, buffer > size in memory, buffer size spilled to disk. > +1. Will be really useful to troubleshoot what's going on in the ReorderBuffer. If we put that in WalSnd it will not be useful for the connections which are using normal backends to get logical changes through built-in functions. -- Best Wishes, Ashutosh Bapat
Re: Small issues with CREATE TABLE COMPRESSION
On Wed, May 5, 2021 at 7:09 AM Dilip Kumar wrote: > So basically, if we have to write this test case in pg_dump then we > will have to use lz4 which means it will generate different output > --with-lz4 vs --without-lz4. With a simple regress test it easy to > deal with such cases by keeping multiple .out files but I am not sure > can we do this easily with pg_dump test without adding much > complexity? TAP tests have a facility for conditionally skipping tests; see perldoc Test::More. That's actually superior to what you can do with pg_regress. We'd need to come up with some logic to determine when to skip or not, though. Perhaps the easiest method would be to have the relevant Perl script try to create a table with an lz4 column. If that works, then perform the LZ4-based tests. If it fails, check the error message. If it says anything that LZ4 is not supported by this build, skip those tests. If it says anything else, die. -- Robert Haas EDB: http://www.enterprisedb.com
RE: Use simplehash.h instead of dynahash in SMgr
Hey David, > I think you'd have to batch by filenode and transaction in that case. Each > batch might be pretty small on a typical OLTP workload, so it might not help > much there, or it might hinder. True, it is very workload dependent (I was chasing mainly INSERTs multiValues, INSERT-SELECT) that often hit the same $block, certainly not OLTP. I would even say that INSERT-as-SELECT would be more suited for DWH-like processing. > But having said that, I don't think any of those possibilities should stop us > speeding up smgropen(). Of course! I've tried a couple of much more smaller ideas, but without any big gains. I was able to squeeze like 300-400k function calls per second (WAL records/s), that was the point I think where I think smgropen() got abused. > > Another potential option that we've discussed is that the redo generation > itself is likely a brake of efficient recovery performance today (e.g. INSERT- > SELECT on table with indexes, generates interleaved WAL records that touch > often limited set of blocks that usually put Smgr into spotlight). > > I'm not quite sure if I understand what you mean here. Is this queuing up > WAL records up during transactions and flush them out to WAL every so > often after rearranging them into an order that's more optimal for replay? Why not both? 😉 We were very concentrated on standby side, but on primary side one could also change how WAL records are generated: 1) Minimalization of records towards same repeated $block eg. Heap2 table_multi_insert() API already does this and it matters to generate more optimal stream for replay: postgres@test=# create table t (id bigint primary key); postgres@test=# insert into t select generate_series(1, 10); results in many calls due to interleave heap with btree records for the same block from Smgr perspective (this is especially visible on highly indexed tables) => rmgr: Btree len (rec/tot): 64/64, tx: 17243284, lsn: 4/E7000108, prev 4/E7A0, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32794 blk 1 rmgr: Heaplen (rec/tot): 63/63, tx: 17243284, lsn: 4/E7000148, prev 4/E7000108, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0 rmgr: Btree len (rec/tot): 64/64, tx: 17243284, lsn: 4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32794 blk 1 rmgr: Heaplen (rec/tot): 63/63, tx: 17243284, lsn: 4/E70001C8, prev 4/E7000188, desc: INSERT off 3 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0 rmgr: Btree len (rec/tot): 64/64, tx: 17243284, lsn: 4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32794 blk 1 rmgr: Heaplen (rec/tot): 63/63, tx: 17243284, lsn: 4/E7000248, prev 4/E7000208, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0 rmgr: Btree len (rec/tot): 64/64, tx: 17243284, lsn: 4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off 4, blkref #0: rel 1663/16384/32794 blk 1 rmgr: Heaplen (rec/tot): 63/63, tx: 17243284, lsn: 4/E70002C8, prev 4/E7000288, desc: INSERT off 5 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0 [..] Similar stuff happens for UPDATE. It basically prevents recent-buffer optimization that avoid repeated calls to smgropen(). And here's already existing table_multi_inserts v2 API (Heap2) sample with obvious elimination of unnecessary individual calls to smgopen() via one big MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) : postgres@test=# create table t (id bigint primary key); postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t'; postgres@test=# copy t from '/tmp/t'; => rmgr: Heap2 len (rec/tot):210/ 210, tx: 17243290, lsn: 4/E928, prev 4/E8004410, desc: MULTI_INSERT+INIT 10 tuples flags 0x02, blkref #0: rel 1663/16384/32801 blk 0 rmgr: Btree len (rec/tot):102/ 102, tx: 17243290, lsn: 4/E9000100, prev 4/E928, desc: NEWROOT lev 0, blkref #0: rel 1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0 rmgr: Btree len (rec/tot): 64/64, tx: 17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32804 blk 1 rmgr: Btree len (rec/tot): 64/64, tx: 17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32804 blk 1 rmgr: Btree len (rec/tot): 64/64, tx: 17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32804 blk 1 [..] Here Btree it is very localized (at least when concurrent sessions are not generating WAL) and it enables Thomas's recent-buffer to kick in DELETE is much more simple (thanks to not chewing out those Btree records) and also thanks to Thomas's recent-buffer should theoretically put much less stress on smgropen() already: rmgr: Heaplen (rec/tot): 54/54, tx: 17243
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Tue, May 4, 2021 at 11:47 PM Greg Nancarrow wrote: > Problem is, for built-in functions, the changes are allowed, but for > some properties (like strict) the allowed changes don't actually take > effect (this is what Amit was referring to - so why allow those > changes?). > It's because some of the function properties are cached in > FmgrBuiltins[] (for a "fast-path" lookup for built-ins), according to > their state at build time (from pg_proc.dat), but ALTER FUNCTION is > just changing it in the system catalogs. Also, with sufficient > privileges, a built-in function can be redefined, yet the original > function (whose info is cached in FmgrBuiltins[]) is always invoked, > not the newly-defined version. I agree. I think that's not ideal. I think we should consider putting some more restrictions on updating system catalog changes, and I also think that if we can get out of having strict need to be part of FmgrBuiltins[] that would be good. But what I don't agree with is the idea that since strict already has this problem, it's OK to do the same thing with parallel-safety. That seems to me to be making a bad situation worse, and I can't see what problem it actually solves. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Tue, May 4, 2021 at 9:24 PM Peter Geoghegan wrote: > Here is my concern: I have an obligation to make it clear that I think > that you really ought to straighten out this business with > generalizing TIDs before too long. Not because I say so, but because > it's holding up progress in general. If you aren't getting cooperation > from people who work on indexing (could be somebody else), then > consider the possibility that this business with TIDs and bitmap scans > has a lot to do with it. Most people are not as outspoken as I am. It seems to me that we're doing a lot of disagreeing given that, as I see it, there are only relatively minor differences between the positions of the various people here. Andres and I are, I think, relatively convinced that variable-width TIDs would let us do things that aren't otherwise possible, and that those things are potentially useful and I would even venture to say cool. I don't believe you disagree with that, but you think it's going to be too much work to implement. Fair enough; anyone can try it who is interested and see how far they get. Anyone who thinks it's going to be impossibly hard probably will prefer not to try, and that's OK too. But if we take that off the table, what about a less-ambitious generalization of the TID mechanism? I can't really see anyone putting up a serious argument against allowing all 48 bits of space available in the existing TID format to be used which, as Jeff points out, is not currently the case. So anyone who wants to try to write that patch is free to do so. I don't have a clear idea how to make that work, to be honest, but my limited supply of ideas need not prevent anyone else from trying theirs. There might be some slight disagreement about whether it's useful to generalize TIDs from a 48-bit address space to a 64-bit address space without making it fully general. Like Andres, I am unconvinced that's meaningfully easier, and I am convinced that it's meaningfully less good, but other people can disagree and that's fine. I'm perfectly willing to change my opinion if somebody shows up with a patch that demonstrates the value of this approach. The main point here is one that I think you made a few emails back: the limitations of the current system are defined by what will actually work with the code as it exists today, not some mailing list discussion. It's too early for the project to commit to stability in this area; we have not managed to get a single AM apart from heapam into core, and that situation doesn't appear likely to change in the near future. If and when we have say 5 of those we can probably articulate some intelligent ideas about what we think the patterns that need to hold for future AMs are, but it's reckless to extrapolate from 1 working example, and right now that's all we have. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
On Wed, 10 Feb 2021 at 04:28, Robert Haas wrote: > On Mon, Feb 8, 2021 at 4:52 PM Andres Freund wrote: > > The 011_crash_recovery.pl test test starts a transaction, creates a > > table, fetches the transaction's xid. Then shuts down the server in > > immediate mode. It then asserts that after crash recovery the previously > > assigned xid is shown as aborted, and that new xids are assigned after > > the xid. > > > > But as far as I can tell there's no guarantee that that is the case. > > I think you are right. > > Andres, I missed this mail initially. I'll look into it shortly. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
pg_receivewal makes a bad daemon
You might want to use pg_receivewal to save all of your WAL segments somewhere instead of relying on archive_command. It has, at the least, the advantage of working on the byte level rather than the segment level. But it seems to me that it is not entirely suitable as a substitute for archiving, for a couple of reasons. One is that as soon as it runs into a problem, it exits, which is not really what you want out of a daemon that's critical to the future availability of your system. Another is that you can't monitor it aside from looking at what it prints out, which is also not really what you want for a piece of critical infrastructure. The first problem seems somewhat more straightforward. Suppose we add a new command-line option, perhaps --daemon but we can bikeshed. If this option is specified, then it tries to keep going when it hits a problem, rather than just giving up. There's some fuzziness in my mind about exactly what this should mean. If the problem we hit is that we lost the connection to the remote server, then we should try to reconnect. But if the problem is something like a failure inside open_walfile() or close_walfile(), like a failed open() or fsync() or close() or something, it's a little less clear what to do. Maybe one idea would be to have a parent process and a child process, where the child process does all the work and the parent process just keeps re-launching it if it dies. It's not entirely clear that this is a suitable way of recovering from, say, an fsync() failure, given previous discussions claiming that - and I might be exaggerating a bit here - there is essentially no way to recover from a failed fsync() because the kernel might have already thrown out your data and you might as well just set the data center on fire - but perhaps an retry system that can't cope with certain corner cases is better than not having one at all, and perhaps we could revise the logic here and there to have the process doing the work take some action other than exiting when that's an intelligent approach. The second problem is a bit more complex. If you were transferring WAL to another PostgreSQL instance rather than to a frontend process, you could log to some place other than standard output, like for example a file, and you could periodically rotate that file, or alternatively you could log to syslog or the Windows event log. Even better, you could connect to PostgreSQL and run SQL queries against monitoring views and see what results you get. If the existing monitoring views don't give users what they need, we can improve them, but the whole infrastructure needed for this kind of thing is altogether lacking for any frontend program. It does not seem very appealing to reinvent log rotation, connection management, and monitoring views inside pg_receivewal, let alone in every frontend process where similar monitoring might be useful. But at least for me, without such capabilities, it is a little hard to take pg_receivewal seriously. I wonder first of all whether other people agree with these concerns, and secondly what they think we ought to do about it. One option is - do nothing. This could be based either on the idea that pg_receivewal is hopeless, or else on the idea that pg_receivewal can be restarted by some external system when required and monitored well enough as things stand. A second option is to start building out capabilities in pg_receivewal to turn it into something closer to what you'd expect of a normal daemon, with the addition of a retry capability as probably the easiest improvement. A third option is to somehow move towards a world where you can use the server to move WAL around even if you don't really want to run the server. Imagine a server running with no data directory and only a minimal set of running processes, just (1) a postmaster and (2) a walreceiver that writes to an archive directory and (3) non-database-connected backends that are just smart enough to handle queries for status information. This has the same problem that I mentioned on the thread about monitoring the recovery process, namely that we haven't got pg_authid. But against that, you get a lot of infrastructure for free: configuration files, process management, connection management, an existing wire protocol, memory contexts, rich error reporting, etc. I am curious to hear what other people think about the usefulness (or lack thereof) of pg_receivewal as thing stand today, as well as ideas about future direction. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Proposal: per expression intervalstyle
Problem: I have to set IntervalStyle in separate statement to convert interval type to ISO8601 string. This isn't well supported by ORMs or similar tools, requiring us to set it globally (per role or per database). Problem #2 (exotic): you can't mix two output styles in a single query. Proposed solution: special case to_char function to accept values accepted by SET intervalstyle to 'XYZ' So: SELECT to_char(INTERVAL '8 minutes', 'iso_8601') will act similar to SET intervalstyle TO 'iso_8601'; SELECT (INTERVAL '8 minutes')::text RESET interval_style;
Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
On Tue, 9 Feb 2021 at 05:52, Andres Freund wrote: > > Craig, it kind of looks to me like you assumed it'd be guaranteed that > the xid at this point would show in-progress? > At the time I wrote that code, I don't think I understood that xid assignment wasn't necessarily durable until either (a) the next checkpoint; or (b) commit of some txn with a greater xid. IIRC I expected that after crash and recovery the tx would always be treated as aborted, because the xid had been assigned but no corresponding commit was found before end-of-recovery. No explicit abort records are written to WAL for such txns since we crashed, but the server's oldest in-progress txn threshold is used to determine that they must be aborted rather than in-progress even though their clog entries aren't set to aborted. Which was fine as far as it went, but I failed to account for the xid assignment not necessarily being durable when the client calls txid_status(). > I don't think the use of txid_status() described in the docs added in > the commit is actually ever safe? > I agree. The client can query for its xid with txid_current() but as you note there's no guarantee that the assigned xid is durable. The client would have to ensure that an xid was assigned, then ensure that the WAL was durably flushed past the point of the xid assignment before relying on the xid. If we do a txn that performs a small write, calls txid_current(), and sends a commit that the server crashes before completing, we can't know for sure that the xid we recorded client-side before the server crash is the same txn we check the status of after crash recovery. Some other txn could've re-used the xid after crash so long as no other txn with a greater xid durably committed before the crash. That scenario isn't hugely likely, but it's definitely possible on systems that don't do a lot of concurrent txns or do mostly long, heavyweight txns. The txid_status() function was originally intended to be paired with a way to report topxid assignment to the client automatically, NOTIFY or GUC_REPORT-style. But that would not make this usage safe either, unless we delayed the report until WAL was flushed past the LSN of the xid assignment *or* some other txn with a greater xid committed. This could be made safe with a variant of txid_current() that forced the xid assignment to be logged immediately if it was not already, and did not return until WAL flushed past the point of the assignment. If the client did most of the txn's work before requesting a guaranteed-durable xid, it would in practice not land up having to wait for a flush. But we'd have to keep track of when we assigned the xid in every single topxact in order to be able to promise we'd flushed it without having to immediately force a flush. That's pointless overhead all the rest of the time, just in case someone wants to get an xid for later use with txid_status(). The simplest option with no overhead on anything that doesn't care about txid_status() is to expose a function to force flush of WAL up to the current insert LSN. Then update the docs to say you have to call it after txid_current(), and before sending your commit. But at that point you might as well use 2PC, since you're paying the same double flush and double round-trip costs. The main point of txid_status() was to avoid the cost of that double-flush. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
On Wed, 5 May 2021 at 23:15, Craig Ringer wrote: > Which was fine as far as it went, but I failed to account for the xid > assignment not necessarily being durable when the client calls > txid_status(). Ahem, I meant "when the client calls txid_current()" -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
Hi, >From time to time, I need to deal with bizarrely delimited text files, having to use tools such as awk/sed/perl to reformat the files so that they can be copied into PostgreSQL. If such files could be imported to a table with a single text column, we could then use PostgreSQL's now really fast regex-engine to do data cleaning/reformatting, instead of having to rely on external awk-ward tools. Furthermore, sometimes you don't want to clean/reformat the data at all, but simply import the text lines "as is" without modifications, such as when wanting to import unformatted log files, where the log lines can contain any characters. Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error? ERROR: COPY delimiter cannot be newline or carriage return That is, to change E'\n' to be a valid delimiter, which would simply read each line delimited by newlines, as a single column. The hack I'm currently abusing is to find some one-byte character that is not present anywhere in the text file, and then to use that character as a delimiter. This doesn't work when needing to deal with a text file which content is unknown at the time when writing the code though, so it's mostly useful for throwaway one-off queries. Thoughts? /Joel MySQL seems to already support using \n as a delimiter (I haven't verified it myself though) [1] [1] https://stackoverflow.com/questions/18394620/postgres-import-file-that-has-columns-separated-by-new-lines
Re: pg_receivewal makes a bad daemon
On Wed, 2021-05-05 at 11:04 -0400, Robert Haas wrote: > You might want to use pg_receivewal to save all of your WAL segments > somewhere instead of relying on archive_command. It has, at the least, > the advantage of working on the byte level rather than the segment > level. But it seems to me that it is not entirely suitable as a > substitute for archiving, for a couple of reasons. One is that as soon > as it runs into a problem, it exits, which is not really what you want > out of a daemon that's critical to the future availability of your > system. Another is that you can't monitor it aside from looking at > what it prints out, which is also not really what you want for a piece > of critical infrastructure. > > The first problem seems somewhat more straightforward. Suppose we add > a new command-line option, perhaps --daemon but we can bikeshed. If > this option is specified, then it tries to keep going when it hits a > problem, rather than just giving up. [...] That sounds like a good idea. I don't know what it takes to make that perfect (if such a thing exists), but simply trying to re-establish database connections and dying when we hit an I/O problem seems like a clear improvement. > The second problem is a bit more complex. [...] If I wanted to monitor pg_receivewal, I'd have it use a replication slot and monitor "pg_replication_slots" on the primary. That way I see if there is a WAL sender process, and I can measure the lag in bytes. What more could you want? Yours, Laurenz Albe
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 7:27 AM Robert Haas wrote: > It seems to me that we're doing a lot of disagreeing given that, as I > see it, there are only relatively minor differences between the > positions of the various people here. I'm being very vocal here because I'm concerned that we're going about generalizing TIDs in the wrong way. To me it feels like there is a loss of perspective about what really matters. There just isn't that many table AM TID designs that could ever work, and even among those schemes that could ever work there is a pretty clear hierarchy. This blue sky thinking about generalizing TIDs 2 years in seems *weird* to me. Nobody is obligated to reassure me. I felt that this had to be said. > Andres and I are, I think, > relatively convinced that variable-width TIDs would let us do things > that aren't otherwise possible, and that those things are potentially > useful and I would even venture to say cool. I don't believe you > disagree with that, but you think it's going to be too much work to > implement. Fair enough; anyone can try it who is interested and see > how far they get. Anyone who thinks it's going to be impossibly hard > probably will prefer not to try, and that's OK too. I think that's accurate. But it's easy to not disagree with the idea that variable-width TIDs might lead to something interesting. Talk is cheap. No other database system has something like indirect indexes. They have clustered indexes, but that's rather different. I think that indirect indexes were a design that was concerned about the issue of write amplification from non-HOT updates. But do I even remember the details correctly? We're talking about indirect indexes as if that was an idea whose high level user-visible goals were clear, but I don't even think that that much is true. This kind of thing concerns me. It very much feels like failing to see the forest for the trees. > But if we take that off the table, what about a less-ambitious > generalization of the TID mechanism? I can't really see anyone putting > up a serious argument against allowing all 48 bits of space available > in the existing TID format to be used which, as Jeff points out, is > not currently the case. So anyone who wants to try to write that patch > is free to do so. I don't have a clear idea how to make that work, to > be honest, but my limited supply of ideas need not prevent anyone else > from trying theirs. I agree that we should focus on what we can agree on. It seems as if we all more or less agree on this much. > There might be some slight disagreement about whether it's useful to > generalize TIDs from a 48-bit address space to a 64-bit address space > without making it fully general. Like Andres, I am unconvinced that's > meaningfully easier, and I am convinced that it's meaningfully less > good, but other people can disagree and that's fine. I'm perfectly > willing to change my opinion if somebody shows up with a patch that > demonstrates the value of this approach. It's going to be hard if not impossible to provide empirical evidence for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are the way to go. Same with any other scheme. We're talking way too much about TIDs themselves and way too little about table AM use cases, the way the data structures might work in new table AMs, and so on. > The main point here is one that I think you made a few emails back: > the limitations of the current system are defined by what will > actually work with the code as it exists today, not some mailing list > discussion. Right. > It's too early for the project to commit to stability in > this area; we have not managed to get a single AM apart from heapam > into core, and that situation doesn't appear likely to change in the > near future. I would be happy if we could commit to committing to stability. I really don't think that it's *that* hard to move significantly closer to a design that describes just how close to heapam a table AM should be. It doesn't commit the table AM to all that many details. > If and when we have say 5 of those we can probably > articulate some intelligent ideas about what we think the patterns > that need to hold for future AMs are, but it's reckless to extrapolate > from 1 working example, and right now that's all we have. I don't think that there will be 5 table AMs that are credible to users at any point in the future. In any case there only needs to be 1 or 2 good ones for the table AM to have been a resounding success. -- Peter Geoghegan
Re: Printing backtrace of postgres processes
On Wed, Feb 3, 2021 at 3:24 PM Bharath Rupireddy wrote: > > On Wed, Feb 3, 2021 at 1:49 PM vignesh C wrote: > > > > On Wed, Feb 3, 2021 at 1:00 PM Tom Lane wrote: > > > > > > vignesh C writes: > > > > On Mon, Feb 1, 2021 at 11:04 AM Bharath Rupireddy > > > > wrote: > > > >> Are these superuser and permission checks enough from a security > > > >> standpoint that we don't expose some sensitive information to the > > > >> user? > > > > > > > This will just print the backtrace of the current backend. Users > > > > cannot get password information from this. > > > > > > Really? > > > > > > A backtrace normally exposes the text of the current query, for > > > instance, which could contain very sensitive data (passwords in ALTER > > > USER, customer credit card numbers in ordinary data, etc etc). We > > > don't allow the postmaster log to be seen by any but very privileged > > > users; it's not sane to think that this data is any less > > > security-critical than the postmaster log. > > > > > > This point is entirely separate from the question of whether > > > triggering stack traces at inopportune moments could cause system > > > malfunctions, but that question is also not to be ignored. > > > > > > TBH, I'm leaning to the position that this should be superuser > > > only. I do NOT agree with the idea that ordinary users should > > > be able to trigger it, even against backends theoretically > > > belonging to their own userid. (Do I need to point out that > > > some levels of the call stack might be from security-definer > > > functions with more privilege than the session's nominal user?) > > > > > > > I had seen that the log that will be logged will be something like: > > postgres: test postgres [local] > > idle(ProcessClientReadInterrupt+0x3a) [0x9500ec] > > postgres: test postgres [local] idle(secure_read+0x183) [0x787f43] > > postgres: test postgres [local] idle() [0x7919de] > > postgres: test postgres [local] idle(pq_getbyte+0x32) [0x791a8e] > > postgres: test postgres [local] idle() [0x94fc16] > > postgres: test postgres [local] idle() [0x950099] > > postgres: test postgres [local] idle(PostgresMain+0x6d5) [0x954bd5] > > postgres: test postgres [local] idle() [0x898a09] > > postgres: test postgres [local] idle() [0x89838f] > > postgres: test postgres [local] idle() [0x894953] > > postgres: test postgres [local] idle(PostmasterMain+0x116b) > > [0x89422a] > > postgres: test postgres [local] idle() [0x79725b] > > /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6e68d7] > > postgres: test postgres [local] idle() [0x484249] > > I was not sure if we would be able to get any secure information from > > this. I did not notice the function arguments being printed. I felt > > the function name, offset and the return address will be logged. I > > might be missing something here. > > Thoughts? > > First of all, we need to see if the output of pg_print_backtrace shows > up function parameter addresses or only function start addresses along > with line and file information when attached to gdb. In either case, > IMO, it will be easy for experienced hackers(I'm not one though) to > calculate and fetch the query string or other function parameters or > the variables inside the functions from the stack by looking at the > code (which is available openly, of course). > > Say, if a backend is in a long running scan or insert operation, then > pg_print_backtrace is issued from another session, the > exec_simple_query function shows up query_string. Below is captured > from attached gdb though, I'm not sure whether the logged backtrace > will have function address or the function parameters addresses, I > think we can check that by having a long running query which > frequently checks interrupts and issue pg_print_backtrace from another > session to that backend. Now, attach gdb to the backend in which the > query is running, then take bt, see if the logged backtrace and the > gdb bt have the same or closer addresses. > > #13 0x5644f4320729 in exec_simple_query ( > query_string=0x5644f6771bf0 "select pg_backend_pid();") at postgres.c:1240 > #14 0x5644f4324ff4 in PostgresMain (argc=1, argv=0x7ffd819bd5e0, > dbname=0x5644f679d2b8 "postgres", username=0x5644f679d298 "bharath") > at postgres.c:4394 > #15 0x5644f4256f9d in BackendRun (port=0x5644f67935c0) at > postmaster.c:4484 > #16 0x5644f4256856 in BackendStartup (port=0x5644f67935c0) at > postmaster.c:4206 > #17 0x5644f4252a11 in ServerLoop () at postmaster.c:1730 > #18 0x5644f42521aa in PostmasterMain (argc=3, argv=0x5644f676b1f0) > at postmaster.c:1402 > #19 0x5644f4148789 in main (argc=3, argv=0x5644f676b1f0) at main.c:209 > > As suggested by Tom, I'm okay if this function is callable only by the > superusers. In that case, the superusers can fetch the backtrace and > send it for further analysis in case of
Re: useless argument of ATAddForeignKeyConstraint
On 2019-Feb-14, Amit Langote wrote: > While reviewing the foreign keys referencing partitioned tables patch, I > noticed that the parentConstr argument of ATAddForeignConstraint is > rendered useless by the following commit: > Maybe remove that argument in HEAD ? Attached a patch. Indeed -- two years later this is still valid, so applied, with thanks! -- Álvaro Herrera39°49'30"S 73°17'W "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
Re: pg_receivewal makes a bad daemon
On Wed, May 5, 2021 at 5:04 PM Robert Haas wrote: > > You might want to use pg_receivewal to save all of your WAL segments > somewhere instead of relying on archive_command. It has, at the least, > the advantage of working on the byte level rather than the segment > level. But it seems to me that it is not entirely suitable as a > substitute for archiving, for a couple of reasons. One is that as soon > as it runs into a problem, it exits, which is not really what you want > out of a daemon that's critical to the future availability of your > system. Another is that you can't monitor it aside from looking at > what it prints out, which is also not really what you want for a piece > of critical infrastructure. > > The first problem seems somewhat more straightforward. Suppose we add > a new command-line option, perhaps --daemon but we can bikeshed. If > this option is specified, then it tries to keep going when it hits a > problem, rather than just giving up. There's some fuzziness in my mind > about exactly what this should mean. If the problem we hit is that we > lost the connection to the remote server, then we should try to > reconnect. But if the problem is something like a failure inside > open_walfile() or close_walfile(), like a failed open() or fsync() or > close() or something, it's a little less clear what to do. Maybe one > idea would be to have a parent process and a child process, where the > child process does all the work and the parent process just keeps > re-launching it if it dies. It's not entirely clear that this is a > suitable way of recovering from, say, an fsync() failure, given > previous discussions claiming that - and I might be exaggerating a bit > here - there is essentially no way to recover from a failed fsync() > because the kernel might have already thrown out your data and you > might as well just set the data center on fire - but perhaps an retry > system that can't cope with certain corner cases is better than not > having one at all, and perhaps we could revise the logic here and > there to have the process doing the work take some action other than > exiting when that's an intelligent approach. Is this really a problem we should fix ourselves? Most daemon-managers today will happily be configured to automatically restart a daemon on failure with a single setting since a long time now. E.g. in systemd (which most linuxen uses now) you just set Restart=on-failure (or maybe even Restart=always) and something like RestartSec=10. That said, it wouldn't cover an fsync() error -- they will always restart. The way to handle that is for the operator to capture the error message perhaps, and just "deal with it"? What could be more interesting there in a "systemd world" would be to add watchdog support. That'd obviously only be interesting on systemd platforms, but we already have some of that basic notification support in the postmaster for those. > The second problem is a bit more complex. If you were transferring WAL > to another PostgreSQL instance rather than to a frontend process, you > could log to some place other than standard output, like for example a > file, and you could periodically rotate that file, or alternatively > you could log to syslog or the Windows event log. Even better, you > could connect to PostgreSQL and run SQL queries against monitoring > views and see what results you get. If the existing monitoring views > don't give users what they need, we can improve them, but the whole > infrastructure needed for this kind of thing is altogether lacking for > any frontend program. It does not seem very appealing to reinvent log > rotation, connection management, and monitoring views inside > pg_receivewal, let alone in every frontend process where similar > monitoring might be useful. But at least for me, without such > capabilities, it is a little hard to take pg_receivewal seriously. Again, isn't this the job of the daemon runner? At least in cases where it's not Windows :)? That is, taking the output and putting it in a log, and interfacing with log rotation. Now, having some sort of statistics *other* than parsing a log would definitely be useful. But perhaps that could be something as simple having a --statsfile=/foo/bar parameter and then update that one at regular intervals with "whatever is the current state"? And of course, the other point to monitor is the replication slot on the server it's connected to -- but I agree that being able to monitor both sides there would be good. > I wonder first of all whether other people agree with these concerns, > and secondly what they think we ought to do about it. One option is - > do nothing. This could be based either on the idea that pg_receivewal > is hopeless, or else on the idea that pg_receivewal can be restarted > by some external system when required and monitored well enough as > things stand. A second option is to start building out capabilities in > pg_receivewal to turn it into something clo
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan wrote: > I'm being very vocal here because I'm concerned that we're going about > generalizing TIDs in the wrong way. To me it feels like there is a > loss of perspective about what really matters. Well, which things matter is a question of opinion, not fact. > No other database system has something like indirect indexes. They > have clustered indexes, but that's rather different. I don't think this is true at all. If you have a clustered index - i.e. the table is physically arranged according to the index ordering - then your secondary indexes all pretty much have to be what we're calling indirect indexes. They can hardly point to a physical identifier if rows are being moved around. I believe InnoDB works this way, and I think Oracle's index-organized tables do too. I suspect there are other examples. > > There might be some slight disagreement about whether it's useful to > > generalize TIDs from a 48-bit address space to a 64-bit address space > > without making it fully general. Like Andres, I am unconvinced that's > > meaningfully easier, and I am convinced that it's meaningfully less > > good, but other people can disagree and that's fine. I'm perfectly > > willing to change my opinion if somebody shows up with a patch that > > demonstrates the value of this approach. > > It's going to be hard if not impossible to provide empirical evidence > for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are > the way to go. Same with any other scheme. We're talking way too much > about TIDs themselves and way too little about table AM use cases, the > way the data structures might work in new table AMs, and so on. I didn't mean that it has to be a test result showing that 64-bit TIDs outperform 56-bit TIDs or something. I just meant there has to be a reason to believe it's good, which could be based on a discussion of use cases or whatever. If we *don't* have a reason to believe it's good, we shouldn't do it. My point is that so far I am not seeing a whole lot of value of this proposed approach. For a 64-bit TID to be valuable to you, one of two things has to be true: you either don't care about having indexes that store TIDs on your new table type, or the index types you want to use can store those 64-bit TIDs. Now, I have not yet heard of anyone working on a table AM who does not want to be able to support adding btree indexes. There may be someone that I don't know about, and if so, fine. But otherwise, we need a way to store them. And that requires changing the page format for btree indexes. But surely we do not want to make all TIDs everywhere wider in future btree versions, so at least two TID widths - 6 bytes and 8 bytes - would have to be supported. And if we're at all going to do that, I think it's certainly worth asking whether supporting varlena TIDs would really be all that much harder. You seem to think it is, and you might be right, but I'm not ready to give up, because I do not see how we are ever going to get global indexes or indirect indexes without doing it, and those would be good features to have. If we can't ever get them, so be it, but you seem to kind of be saying that things like global indexes and indirect indexes are hard, and therefore they don't count as reasons why we might want variable-width TIDs. But one very large reason why those things are hard is that they require variable-width TIDs, so AFAICS this boils down to saying that we don't want the feature because it's hard to implement. But we should not conflate feasibility with desirability. I am quite sure that lots of people want global indexes. The number of people who want indirect indexes is in my estimation much smaller, but it's probably not zero, or else Alvaro wouldn't have tried his hand at writing a patch. Whether we can *get* those things is in doubt; whether it will happen in the near future is very much in doubt. But I at least am not in doubt about whether people want it, because I hear complaints about the lack of global indexes on an almost-daily basis. If those complaints are all from people hoping to fake me out into spending time on something that is worthless to them, my colleagues are very good actors. -- Robert Haas EDB: http://www.enterprisedb.com
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On Wed, May 5, 2021 at 8:31 AM Joel Jacobson wrote: > Could it be an idea to exploit the fact that DELIMITER E'\n' is currently > an error? > > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly what it says and does exactly what you desire? David J.
Re: .ready and .done files considered harmful
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Tue, May 4, 2021 at 11:54 AM Dilip Kumar wrote: > > I agree that if we continue to archive one file using the archive > > command then Robert's solution of checking the existence of the next > > WAL segment (N+1) has an advantage. But, currently, if you notice > > pgarch_readyXlog always consider any history file as the oldest file > > but that will not be true if we try to predict the next WAL segment > > name. For example, if we have archived 00010004 then > > next we will look for 00010005 but after generating > > segment 00010005, if there is a timeline switch then > > we will have the below files in the archive status > > (00010005.ready, 0002.history file). Now, the > > existing archiver will archive 0002.history first whereas our code > > will archive 00010005 first. Said that I don't see > > any problem with that because before archiving any segment file from > > TL 2 we will definitely archive the 0002.history file because we > > will not find the 00010006.ready and we will scan the > > full directory and now we will find 0002.history as oldest file. > > OK, that makes sense and is good to know. I expect David will chime in on this thread too, but I did want to point out that when it coming to archiving history files you'd *really* like that to be done just about as quickly as absolutely possible, to avoid the case that we saw before that code was added, to wit: two promotions done too quickly that ended up with conflicting history and possibly conflicting WAL files trying to be archived, and ensuing madness. It's not just about making sure that we archive the history file for a timeline before archiving WAL segments along that timeline but also about making sure we get that history file into the archive as fast as we can, and archiving a 16MB WAL first would certainly delay that. Thanks, Stephen signature.asc Description: PGP signature
Re: pg_receivewal makes a bad daemon
On Wed, May 5, 2021 at 12:34 PM Magnus Hagander wrote: > Is this really a problem we should fix ourselves? Most daemon-managers > today will happily be configured to automatically restart a daemon on > failure with a single setting since a long time now. E.g. in systemd > (which most linuxen uses now) you just set Restart=on-failure (or > maybe even Restart=always) and something like RestartSec=10. > > That said, it wouldn't cover an fsync() error -- they will always > restart. The way to handle that is for the operator to capture the > error message perhaps, and just "deal with it"? Maybe, but if that's really a non-problem, why does postgres itself restart, and have facilities to write and rotate log files? I feel like this argument boils down to "a manual transmission ought to be good enough for anyone, let's not have automatics." But over the years people have found that automatics are a lot easier to drive. It may be true that if you know just how to configure your system's daemon manager, you can make all of this work, but it's not like we document how to do any of that, and it's probably not the same on every platform - Windows in particular - and, really, why should people have to do this much work? If I want to run postgres in the background I can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my crontab to make sure it gets restarted within a few minutes even if the postmaster dies. If I want to keep pg_receivewal running all the time ... I need a whole pile of extra mechanism to work around its inherent fragility. Documenting how that's typically done on modern systems, as you propose further on, would be great, but I can't do it, because I don't know how to make it work. Hence the thread. > Also, all the above also apply to pg_recvlogical, right? So if we do > want to invent our own daemon-init-system, we should probably do one > more generic that can handle both. Yeah. And I'm not really 100% convinced that trying to patch this functionality into pg_receive{wal,logical} is the best way forward ... but I'm not entirely convinced that it isn't, either. I think one of the basic problems with trying to deploy PostgreSQL in 2021 is that it needs so much supporting infrastructure and so much babysitting. archive_command has to be a complicated, almost magical program we don't provide, and we don't even tell you in the documentation that you need it. If you don't want to use that, you can stream with pg_receivewal instead, but now you need a complicated daemon-runner mechanism that we don't provide or document the need for. You also probably need a connection pooler that we don't provide, a failover manager that we don't provide, and backup management software that we don't provide. And the interfaces that those tools have to work with are so awkward and primitive that even the tool authors can't always get it right. So I'm sort of unimpressed by any arguments that boil down to "what we have is good enough" or "that's the job of some other piece of software". Too many things are the job of some piece of software that doesn't really exist, or is only available on certain platforms, or that has some other problem that makes it not usable for everyone. People want to be able to download and use PostgreSQL without needing a whole library of other bits and pieces from around the Internet. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Wed, 2021-05-05 at 10:27 -0400, Robert Haas wrote: > It's too early for the project to commit to stability in > this area; we have not managed to get a single AM apart from heapam > into core "In core" shouldn't matter. In fact, if it's in core, stability of the APIs is much less important. > If and when we have say 5 of those That seems like a standard that we won't reach in any reasonable amount of time. > we can probably > articulate some intelligent ideas about what we think the patterns > that need to hold for future AMs are, but it's reckless to > extrapolate > from 1 working example, and right now that's all we have. We should count columnar as a second example. While it doesn't support everything that heap does, we are actively working on it and it's gaining features quickly. It's also showing some impressive real-world results. Regards, Jeff Davis
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 9:42 AM Robert Haas wrote: > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan wrote: > > I'm being very vocal here because I'm concerned that we're going about > > generalizing TIDs in the wrong way. To me it feels like there is a > > loss of perspective about what really matters. > > Well, which things matter is a question of opinion, not fact. I'm not trying to win an argument here. I am giving an opinion in the hopes that it leads to some kind of useful synthesis, based on all of our opinions. > > No other database system has something like indirect indexes. They > > have clustered indexes, but that's rather different. > > I don't think this is true at all. If you have a clustered index - > i.e. the table is physically arranged according to the index ordering > - then your secondary indexes all pretty much have to be what we're > calling indirect indexes. They can hardly point to a physical > identifier if rows are being moved around. I believe InnoDB works this > way, and I think Oracle's index-organized tables do too. I suspect > there are other examples. But these systems don't have indirect indexes *on a heap table*! Why would they ever do it that way? They already have rowid/TID as a stable identifier of logical rows, so having indirect indexes that point to a heap table's rows would be strictly worse than the generic approach for indexes on a heap table. What we call indirect indexes seem to me to be a failed attempt to solve the "TID is not a stable identifier of logical row" issue that is baked-in to Postgres. If I thought it was worth solving that problem then I suppose I'd solve it directly. The "indirection" of indirect indexes actuallys buys you nothing! It just moves some of the problem somewhere else, at the cost of even more complexity. Indirect indexes (without a clustered index) are a muddled idea. Of course I accept that clustered indexes make sense in general (though less and less these days). But the fact that these systems "use indirect indexes" for secondary indexes is precisely why clustered indexes don't seem like a great design with modern hardware! Should we invest a huge amount of work in order to have all of the disadvantages, and none of the advantages? > My point is that so far I am not seeing a whole lot of value of this > proposed approach. For a 64-bit TID to be valuable to you, one of two > things has to be true: you either don't care about having indexes that > store TIDs on your new table type, or the index types you want to use > can store those 64-bit TIDs. Now, I have not yet heard of anyone > working on a table AM who does not want to be able to support adding > btree indexes. There may be someone that I don't know about, and if > so, fine. But otherwise, we need a way to store them. And that > requires changing the page format for btree indexes. But surely we do > not want to make all TIDs everywhere wider in future btree versions, > so at least two TID widths - 6 bytes and 8 bytes - would have to be > supported. I agree that we don't want a performance/space overhead for simple cases that are quite happy with the current format. > And if we're at all going to do that, I think it's > certainly worth asking whether supporting varlena TIDs would really be > all that much harder. You seem to think it is, and you might be right, > but I'm not ready to give up, because I do not see how we are ever > going to get global indexes or indirect indexes without doing it, and > those would be good features to have. I think that global indexes are well worth having, and should be solved some completely different way. The partition key can be an additive thing. I strongly suspect that indirect indexes (without a clustered index) are 100% useless in both theory and practice, so naturally I have little to no interest. The difficulty of supporting (say) 6 byte and 8 byte TIDs together is vastly lower than variable-width TIDs, for all kinds of reasons. See my remarks to Andres upthread about deduplication. > If we can't ever get them, so be it, but you seem to kind of be saying > that things like global indexes and indirect indexes are hard, and > therefore they don't count as reasons why we might want variable-width > TIDs.But one very large reason why those things are hard is that they > require variable-width TIDs, so AFAICS this boils down to saying that > we don't want the feature because it's hard to implement. More like very hard to implement for a very low benefit. > But we > should not conflate feasibility with desirability. I am quite sure > that lots of people want global indexes. I do too! -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Wed, 2021-05-05 at 08:50 -0700, Peter Geoghegan wrote: > There just isn't that > many table AM TID designs that could ever work, and even among those > schemes that could ever work there is a pretty clear hierarchy. This > blue sky thinking about generalizing TIDs 2 years in seems *weird* to > me. I am happy to keep table AM discussions concrete, as I have plenty of concrete problems which I would like to turn into proposals. Regards, Jeff Davis
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On 05/05/21 13:02, David G. Johnston wrote: > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly > what it says and does exactly what you desire? What would it mean? That you get one column, multiple rows of text corresponding to "lines" delimited by something, or that you get one column, one row of text for the entire content of the file? Regards, -Chap
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 1:13 PM Jeff Davis wrote: > "In core" shouldn't matter. In fact, if it's in core, stability of the > APIs is much less important. I don't know what to say here. I think it's unrealistic to believe that a very new API that has only 1 in-core user is going to be fully stable, or that we can know how it might evolve. I can understand why you and probably other people want that, but if somebody figures out a way to make some part of core significantly better and it requires changing that API, they're going to change the API, not give up on the idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On Wed, 5 May 2021 at 13:23, Chapman Flack wrote: > On 05/05/21 13:02, David G. Johnston wrote: > > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly > > what it says and does exactly what you desire? > > What would it mean? That you get one column, multiple rows of text > corresponding to "lines" delimited by something, or that you get one > column, one row of text for the entire content of the file? > It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value. Would DELIMITER NULL make sense? The existing values are literal strings so NULL fits with that. Do we already have NONE as a keyword somewhere? It's listed in the keyword appendix to the documentation but I can't think of where it is used off the top of my head.
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 10:33 AM Robert Haas wrote: > I don't know what to say here. I think it's unrealistic to believe > that a very new API that has only 1 in-core user is going to be fully > stable, or that we can know how it might evolve. I can understand why > you and probably other people want that, but if somebody figures out a > way to make some part of core significantly better and it requires > changing that API, they're going to change the API, not give up on the > idea. I strongly agree. More generally, we need to decide what downsides we're willing to live with. What we have right now has little chance of failing. It also has little chance of succeeding (except for something like zheap, which can presumably get by with the heapam's idea of TID). -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote: > What we have right now has little chance of failing. It also has > little chance of succeeding (except for something like zheap, which > can presumably get by with the heapam's idea of TID). What has little chance of succeeding? Table AMs? And why isn't columnar an example of someting that can "get by with heapam's idea of TID"? I mean, it's not a perfect fit, but my primary complaint this whole thread is that it's undefined, not that it's completely unworkable. Regards, Jeff Davis
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 1:15 PM Peter Geoghegan wrote: > > I don't think this is true at all. If you have a clustered index - > > i.e. the table is physically arranged according to the index ordering > > - then your secondary indexes all pretty much have to be what we're > > calling indirect indexes. They can hardly point to a physical > > identifier if rows are being moved around. I believe InnoDB works this > > way, and I think Oracle's index-organized tables do too. I suspect > > there are other examples. > > But these systems don't have indirect indexes *on a heap table*! Why > would they ever do it that way? They already have rowid/TID as a > stable identifier of logical rows, so having indirect indexes that > point to a heap table's rows would be strictly worse than the generic > approach for indexes on a heap table. One advantage of indirect indexes is that you can potentially avoid a lot of writes to the index. If a non-HOT update is performed, but the primary key is not updated, the index does not need to be touched. I think that's a potentially significant savings, even if bottom-up index deletion would have prevented the page splits. Similarly, you can mark a dead line pointer unused without having to scan the indirect index, because the index isn't pointing to that dead line pointer anyway. Hmm, but I guess you have another cleanup problem. What prevents someone from inserting a new row with the same primary key as a previously-deleted row but different values in some indirectly-indexed column? Then the old index entries, if still present, could mistakenly refer to the new row. I don't know whether Alvaro thought of that problem when he was working on this previously, or whether he solved it somehow. Possibly that's a big enough problem that the whole idea is dead in the water, but it's not obvious to me that this is so. And, anyway, this whole argument is predicated on the fact that the only table AM we have right now is heapam. If we had a table AM that organized the data by primary key value, we'd still want to be able to have secondary indexes, and they'd have to use the primary key value as the TID. > I think that global indexes are well worth having, and should be > solved some completely different way. The partition key can be an > additive thing. I agree that the partition identifier should be an additive thing, but where would we add it? It seems to me that the obvious answer is to make it a column of the index tuple. And if we can do that, why can't we put whatever kind of TID-like stuff people want in the index tuple, too? Maybe part of the problem here is that I don't actually understand how posting lists are represented... -- Robert Haas EDB: http://www.enterprisedb.com
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On Wed, May 5, 2021 at 10:34 AM Isaac Morland wrote: > On Wed, 5 May 2021 at 13:23, Chapman Flack wrote: > >> On 05/05/21 13:02, David G. Johnston wrote: >> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly >> > what it says and does exactly what you desire? >> >> What would it mean? That you get one column, multiple rows of text >> corresponding to "lines" delimited by something, or that you get one >> column, one row of text for the entire content of the file? >> > > It means no column delimiter. In other words, there is no character > which marks the end of a data value, so the entire line is a single data > value. > > This. When dealing with COPY it's expected that each line becomes its own row. On the server you can do pg_read_file() if you need the entire file to be considered a single value. psql (\I and variables) is a bit more hackey, but I'd rather see that improved directly anyway if the goal is to try and make getting the "whole document" easier - copy isn't the right API for that IMO. David J.
v14 mechanical code beautification patches
It's getting to be time to think about these steps for v14: * Renumber any manually-assigned OIDs between 8000 and to lower numbers, using renumber_oids.pl (see notes in bki.sgml) * pgindent, perltidy, reformat-dat-files * Update config.guess and config.sub (from https://savannah.gnu.org/projects/config) * Update Unicode data: Edit UNICODE_VERSION and CLDR_VERSION in src/Makefile.global.in, run make update-unicode, and commit. It looks like Peter already took care of the last two. Barring objections, I'll plan to do the first two next Wednesday or so (after the back-branch-release dust has settled). I notice that we also list this as a pre-beta task in src/tools/RELEASE_CHANGES: * Update inet/cidr data types with newest Bind patches However, I can't recall that anyone has ever done any such thing; and at this point, any attempt to re-sync that code would likely be a rather major task. Should we take that off the checklist? regards, tom lane
Re: MaxOffsetNumber for Table AMs
Hi, On 2021-05-05 13:32:57 -0400, Robert Haas wrote: > I don't know what to say here. I think it's unrealistic to believe > that a very new API that has only 1 in-core user is going to be fully > stable, or that we can know how it might evolve. I can understand why > you and probably other people want that, but if somebody figures out a > way to make some part of core significantly better and it requires > changing that API, they're going to change the API, not give up on the > idea. Yea. I think it would be actively *bad* if tableam were too stable. tableam is at best an 80% solution to the abstraction needs (those 80% were pretty painful to achieve already, I don't think we could have gotten much more initially). If we get cornered into not evolving the API because of 2-3 external users, we're a) going to live with a leaky abstraction for much longer b) getting more hesitant to work incrementally. Both would be bad. Greetings, Andres Freund
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 10:57 AM Robert Haas wrote: > One advantage of indirect indexes is that you can potentially avoid a > lot of writes to the index. If a non-HOT update is performed, but the > primary key is not updated, the index does not need to be touched. I > think that's a potentially significant savings, even if bottom-up > index deletion would have prevented the page splits. Similarly, you > can mark a dead line pointer unused without having to scan the > indirect index, because the index isn't pointing to that dead line > pointer anyway. As I said, this is equivalent to solving the "TID is a stable identifier of logical row" issue (an exceptionally hard problem that I don't think is worth solving), except that you make the secondary indexes have potentially larger keys for no benefit. Sure, you can consistently refer to a logical row using its PK value (assuming you have this whole two-phase locking infrastructure), but why wouldn't you "just" solve the problem with TID directly instead? What does involving PK values actually buy you? I am pretty sure that the answer is "less than nothing". It is still true that I'm arguing against ever having a clustered index table AM, which would be somewhat useful to users (that much I'll own). The main reason for that is because we'd still be required to solve the "TID is a stable identifier of logical row" issue, except it's not a physiological TID/rowid (it's a fully logical row identifier). So everything seems to lead back to that hard problem seeming insoluble. > Hmm, but I guess you have another cleanup problem. What prevents > someone from inserting a new row with the same primary key as a > previously-deleted row but different values in some indirectly-indexed > column? Two-phase locking in indexes stops it. Note that this is pretty much what happens in Oracle -- it's not just SQL Server. This is why we have rich extensibility indexing -- indexes are strictly physical data structures in Postgres. > And, anyway, this whole argument is predicated on the fact that the > only table AM we have right now is heapam. If we had a table AM that > organized the data by primary key value, we'd still want to be able to > have secondary indexes, and they'd have to use the primary key value > as the TID. But Jeff has a design for the columnstore table AM where TIDs are essentially logical (not physiological like those of heapam), that nevertheless will work with the design around TIDs that I have in mind. "Logical identifiers" versus "Logical identifiers that stably identify logical rows" seems like a subtle but important distinction here. Of course I cannot yet rule out the possibility that this approach to TIDs will always be good enough. But it sure seems like it might be, and starting with the assumption that it is and working backwards seems like a good way to attack the problem as a practical matter. > > I think that global indexes are well worth having, and should be > > solved some completely different way. The partition key can be an > > additive thing. > > I agree that the partition identifier should be an additive thing, but > where would we add it? It seems to me that the obvious answer is to > make it a column of the index tuple. Right. > And if we can do that, why can't > we put whatever kind of TID-like stuff people want in the index tuple, > too? Maybe part of the problem here is that I don't actually > understand how posting lists are represented... You want to use the partition identifier for predicate push-down and stuff anyway, so making it part of the TID doesn't seem particularly natural to me. "Posting list splits" from the nbtree README will give you some idea of why I care about making TIDs integer-like and equi-sized within any given index tuple. There will be similar considerations for GIN. Though I think that nbtree deduplication is important enough on its own to try to preserve across table AMs. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
Hi, On 2021-05-05 10:56:56 -0700, Jeff Davis wrote: > On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote: > > What we have right now has little chance of failing. It also has > > little chance of succeeding (except for something like zheap, which > > can presumably get by with the heapam's idea of TID). > > What has little chance of succeeding? Table AMs? > > And why isn't columnar an example of someting that can "get by with > heapam's idea of TID"? I mean, it's not a perfect fit, but my primary > complaint this whole thread is that it's undefined, not that it's > completely unworkable. Agreed. And we can increase the fit a good bit without needing invasive all-over changes. With those changes often even helping heap. E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem even for heap - we waste a lot of space that's not commonly used. A better datastructure (radix tree like I'd say, but several tree shaped approaches seem possible). Greetings, Andres Freund
Re: RFC: Detailed reorder buffer stats dumps
Hi, On 2021-05-05 18:33:27 +0800, Craig Ringer wrote: > I'm thinking of piggy-backing on the approach used in the "Get memory > contexts of an arbitrary backend process" patch in order to provide access > to detailed reorder buffer content statistics from walsenders on request. > > Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or > on dynamic probes (perf, systemtap) to see what it's doing. I intend a > patch soon to add a couple of fields to struct WalSnd to report some very > coarse reorder buffer stats - at least oldest buffered xid, number of > buffered txns, total bytes of buffered txns in memory, total bytes of > buffered txns spilled to disk. > > But sometimes what I really want is details on the txns that're in the > reorder buffer, and that's not feasible to export via always-enabled > reporting like struct WalSnd. So I'm thinking that the same approach used > for the memory context stats patch might work well for asking the walsender > for a detailed dump of reorder buffer contents. Something like a > per-buffered-txn table of txn topxid, start-lsn, most recent change lsn, > number of changes, number of subxids, number of invalidations, number of > catalog changes, buffer size in memory, buffer size spilled to disk. > > Anyone drastically opposed to the idea? I am doubtful. The likelihood of ending with effectively unused code seems very substantial here. Greetings, Andres Freund
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On Wed, May 5, 2021, at 19:58, David G. Johnston wrote: > On Wed, May 5, 2021 at 10:34 AM Isaac Morland wrote: >> On Wed, 5 May 2021 at 13:23, Chapman Flack wrote: >>> On 05/05/21 13:02, David G. Johnston wrote: >>> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly >>> > what it says and does exactly what you desire? >>> >>> What would it mean? That you get one column, multiple rows of text >>> corresponding to "lines" delimited by something, or that you get one >>> column, one row of text for the entire content of the file? >> >> It means no column delimiter. In other words, there is no character which >> marks the end of a data value, so the entire line is a single data value. >> > > This. When dealing with COPY it's expected that each line becomes its own > row. On the server you can do pg_read_file() if you need the entire file to > be considered a single value. psql (\I and variables) is a bit more hackey, > but I'd rather see that improved directly anyway if the goal is to try and > make getting the "whole document" easier - copy isn't the right API for that > IMO. I think you misunderstood the problem. I don't want the entire file to be considered a single value. I want each line to become its own row, just a row with a single column. So I actually think COPY seems like a perfect match for the job, since it does precisely that, except there is no delimiter in this case. I'm currently using the pg_read_file()-hack in a project, and even though it can read files up to 1GB, using e.g. regexp_split_to_table() to split on E'\n' seems to need 4x as much memory, so it only works with files less than ~256MB. SELECT COUNT(*) FROM regexp_split_to_table(repeat(E'\n',10),E'\n'); ERROR: invalid memory alloc request size 44 Time: 4151.374 ms (00:04.151) /Joel
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 10:56 AM Jeff Davis wrote: > What has little chance of succeeding? Table AMs? > > And why isn't columnar an example of someting that can "get by with > heapam's idea of TID"? I mean, it's not a perfect fit, but my primary > complaint this whole thread is that it's undefined, not that it's > completely unworkable. I think that it could be fairly workable with moderate effort (maybe even without that effort, but that doesn't seem so appealing). To do it well we have to actually generalize TIDs sensibly. And I think that that means admitting that we'll never solve the "TID is a stable identifier of a logical row, not a physical version" problem. ISTM that that's the problem that is at the root of everything here. -- Peter Geoghegan
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On Wed, May 5, 2021, at 19:34, Isaac Morland wrote: > Would DELIMITER NULL make sense? The existing values are literal strings so > NULL fits with that. Do we already have NONE as a keyword somewhere? It's > listed in the keyword appendix to the documentation but I can't think of > where it is used off the top of my head. +1 to using some keyword. NULL or NONE seems fine to me. Or maybe WITHOUT DELIMITER? /Joel
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 11:25 AM Andres Freund wrote: > Agreed. And we can increase the fit a good bit without needing invasive > all-over changes. With those changes often even helping heap. > > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem > even for heap - we waste a lot of space that's not commonly used. A > better datastructure (radix tree like I'd say, but several tree shaped > approaches seem possible). Agreed -- even if we only cared about heapam we still ought to do something about tidbitmap.c's use of MaxHeapTuplesPerPage. -- Peter Geoghegan
Re: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
"Joel Jacobson" writes: > I think you misunderstood the problem. > I don't want the entire file to be considered a single value. > I want each line to become its own row, just a row with a single column. > So I actually think COPY seems like a perfect match for the job, > since it does precisely that, except there is no delimiter in this case. Well, there's more to it than just the column delimiter. * What about \N being converted to NULL? * What about \. being treated as EOF? * Do you want to turn off the special behavior of backslash (ESCAPE) altogether? * What about newline conversions (\r\n being seen as just \n, etc)? I'm inclined to think that "use pg_read_file and then split at newlines" might be a saner answer than delving into all these fine points. Not least because people yell when you add cycles to the COPY inner loops. > I'm currently using the pg_read_file()-hack in a project, > and even though it can read files up to 1GB, > using e.g. regexp_split_to_table() to split on E'\n' > seems to need 4x as much memory, so it only > works with files less than ~256MB. Yeah, that's because of the conversion to "chr". But a regexp is overkill for that anyway. Don't we have something that will split on simple substring matches? regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Wed, 2021-05-05 at 11:22 -0700, Andres Freund wrote: > Yea. I think it would be actively *bad* if tableam were too > stable. tableam is at best an 80% solution to the abstraction needs > (those 80% were pretty painful to achieve already, I don't think we > could have gotten much more initially). If we get cornered into not > evolving the API because of 2-3 external users, we're a) going to > live > with a leaky abstraction for much longer b) getting more hesitant to > work incrementally. Both would be bad. Like anything, we make the decision at the time we have a reason to break something. But why are are exensions disfavored in this calculation vs. in-core? Isn't it a lot easier to update in-core code to new APIs? Evolving the API is one thing, but we should be more careful about things that could affect on-disk state like ItemPointer representations. By "more careful", I don't mean that we reject all proposals; I mean that we don't casually impose new limits in other parts of the system that happen to work for heapam but will cause table AM extensions to break. Regards, Jeff Davis
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On 5/5/21 2:45 PM, Tom Lane wrote: > "Joel Jacobson" writes: >> I think you misunderstood the problem. >> I don't want the entire file to be considered a single value. >> I want each line to become its own row, just a row with a single column. >> So I actually think COPY seems like a perfect match for the job, >> since it does precisely that, except there is no delimiter in this case. > Well, there's more to it than just the column delimiter. > > * What about \N being converted to NULL? > * What about \. being treated as EOF? > * Do you want to turn off the special behavior of backslash (ESCAPE) > altogether? > * What about newline conversions (\r\n being seen as just \n, etc)? > > I'm inclined to think that "use pg_read_file and then split at newlines" > might be a saner answer than delving into all these fine points. > Not least because people yell when you add cycles to the COPY > inner loops. +1 Also we have generally been resistant to supporting odd formats. FDWs can help here (e.g. file_text_array), but they can't use STDIN IIRC. > >> I'm currently using the pg_read_file()-hack in a project, >> and even though it can read files up to 1GB, >> using e.g. regexp_split_to_table() to split on E'\n' >> seems to need 4x as much memory, so it only >> works with files less than ~256MB. > Yeah, that's because of the conversion to "chr". But a regexp > is overkill for that anyway. Don't we have something that will > split on simple substring matches? > > Not that I know of. There is split_part but I don't think that's fit for purpose here. Do we need one, or have I missed something? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 12:09 PM Jeff Davis wrote: > Like anything, we make the decision at the time we have a reason to > break something. But why are are exensions disfavored in this > calculation vs. in-core? Isn't it a lot easier to update in-core code > to new APIs? We don't really have an API for how TIDs behave (unless you happen to want to emulate heapam, which is reasonable and was expected). It's unspecified because nobody knows what it is (or what it should be) just yet. AFAICT there is no TID API to break. -- Peter Geoghegan
Re: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote: > > I'm currently using the pg_read_file()-hack in a project, > > and even though it can read files up to 1GB, > > using e.g. regexp_split_to_table() to split on E'\n' > > seems to need 4x as much memory, so it only > > works with files less than ~256MB. > > Yeah, that's because of the conversion to "chr". But a regexp > is overkill for that anyway. Don't we have something that will > split on simple substring matches? For v14 commit 66f163068030b5c5fe792a0daee27822dac43791 Author: Tom Lane Date: Wed Sep 2 18:23:56 2020 -0400 Add string_to_table() function. -- Justin
Re: MaxOffsetNumber for Table AMs
On Wed, 5 May 2021 at 19:15, Peter Geoghegan wrote: > > On Wed, May 5, 2021 at 9:42 AM Robert Haas wrote: > > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan wrote: > > > I'm being very vocal here because I'm concerned that we're going about > > > generalizing TIDs in the wrong way. To me it feels like there is a > > > loss of perspective about what really matters. > > > > Well, which things matter is a question of opinion, not fact. > > I'm not trying to win an argument here. I am giving an opinion in the > hopes that it leads to some kind of useful synthesis, based on all of > our opinions. > > > > No other database system has something like indirect indexes. They > > > have clustered indexes, but that's rather different. > > > > I don't think this is true at all. If you have a clustered index - > > i.e. the table is physically arranged according to the index ordering > > - then your secondary indexes all pretty much have to be what we're > > calling indirect indexes. They can hardly point to a physical > > identifier if rows are being moved around. I believe InnoDB works this > > way, and I think Oracle's index-organized tables do too. I suspect > > there are other examples. > > But these systems don't have indirect indexes *on a heap table*! Why > would they ever do it that way? They already have rowid/TID as a > stable identifier of logical rows, so having indirect indexes that > point to a heap table's rows would be strictly worse than the generic > approach for indexes on a heap table. > > What we call indirect indexes seem to me to be a failed attempt to > solve the "TID is not a stable identifier of logical row" issue that > is baked-in to Postgres. If I thought it was worth solving that > problem then I suppose I'd solve it directly. The "indirection" of > indirect indexes actuallys buys you nothing! It just moves some of the > problem somewhere else, at the cost of even more complexity. Indirect > indexes (without a clustered index) are a muddled idea. > > Of course I accept that clustered indexes make sense in general > (though less and less these days). But the fact that these systems > "use indirect indexes" for secondary indexes is precisely why > clustered indexes don't seem like a great design with modern hardware! > Should we invest a huge amount of work in order to have all of the > disadvantages, and none of the advantages? > > > My point is that so far I am not seeing a whole lot of value of this > > proposed approach. For a 64-bit TID to be valuable to you, one of two > > things has to be true: you either don't care about having indexes that > > store TIDs on your new table type, or the index types you want to use > > can store those 64-bit TIDs. Now, I have not yet heard of anyone > > working on a table AM who does not want to be able to support adding > > btree indexes. There may be someone that I don't know about, and if > > so, fine. But otherwise, we need a way to store them. And that > > requires changing the page format for btree indexes. But surely we do > > not want to make all TIDs everywhere wider in future btree versions, > > so at least two TID widths - 6 bytes and 8 bytes - would have to be > > supported. > > I agree that we don't want a performance/space overhead for simple > cases that are quite happy with the current format. > > > And if we're at all going to do that, I think it's > > certainly worth asking whether supporting varlena TIDs would really be > > all that much harder. You seem to think it is, and you might be right, > > but I'm not ready to give up, because I do not see how we are ever > > going to get global indexes or indirect indexes without doing it, and > > those would be good features to have. > > I think that global indexes are well worth having, and should be > solved some completely different way. The partition key can be an > additive thing. I believe that it cannot be "just" an additive thing, at least not through a normal INCLUDEd column, as you'd get duplicate TIDs in the index, with its related problems. You also cannot add it as a key column, as this would disable UNIQUE indexes; one of the largest use cases of global indexes. So, you must create specialized infrastructure for this identifier. And when we're already adding specialized infrastructure, then this should probably be part of a new TID infrastructure. And if we're going to change TID infrastructure to allow for more sizes (as we'd need normal TableAM TIDs, and global index partition-identifying TIDs), I'd argue that it should not be too much more difficult to create an infrastructure for 'new TID' in which the table AM supplies type, size and strict ordering information for these 'new TID's. And if this 'new TID' size is not going to be defined by the index AM but by the indexed object (be it a table or a 'global' or whatever we'll build indexes on), I see no reason why this 'new TID' infrastructure couldn't eventually support variable length TIDs; or constant sized usertyp
Dubious assertion in RegisterDynamicBackgroundWorker
I noticed this recent crash on lorikeet: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2021-05-05%2009%3A19%3A29 The relevant bits of the log seem to be 2021-05-05 05:36:22.011 EDT [60926716.bb24:1] ERROR: could not map dynamic shared memory segment ... 2021-05-05 05:36:22.013 EDT [609266c5.b793:4] LOG: background worker "parallel worker" (PID 47908) exited with exit code 1 ... TRAP: FailedAssertion("BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT", File: "/home/andrew/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 1016) *** starting debugger for pid 47743, tid 1264 2021-05-05 05:36:26.629 EDT [609266c5.b793:5] LOG: server process (PID 47743) exited with exit code 127 So we had a parallel worker fail to start, whereupon its leader went down with an assertion failure. I know that the parallel-worker code is held together with chewing gum and baling wire, but that's a bit much. Looking at the indicated code, we find /* * If this is a parallel worker, check whether there are already too many * parallel workers; if so, don't register another one. Our view of * parallel_terminate_count may be slightly stale, but that doesn't really * matter: we would have gotten the same result if we'd arrived here * slightly earlier anyway. There's no help for it, either, since the * postmaster must not take locks; a memory barrier wouldn't guarantee * anything useful. */ if (parallel && (BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers) { Assert(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT); LWLockRelease(BackgroundWorkerLock); return false; } I would like to know on what grounds that Assert isn't insanity? We just finished pointing out that we might see an old parallel_terminate_count, which ISTM says specifically that parallel_register_count minus parallel_terminate_count might be larger than expected. Admittedly, it seems unlikely that the difference could exceed MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where the limit on number of parallel workers is only 8. What I think is more likely, given that these counters are unsigned, is that the difference was actually negative. Which could be a bug, or it could be an expectable race condition, or it could just be some flakiness on lorikeet's part (that machine has had a lot of issues lately). I trawled the buildfarm logs going back 180 days, and found no other instances of this assertion, which seems to be evidence in favor of the "lorikeet got flaky" theory. But it's not proof. In any case, I see zero value in this assertion, so I propose we remove it. If we don't remove it, it needs serious revision, because it seems absolutely obvious to me that it could trigger when there is nothing wrong. A system pushing the limit of number of parallel workers would be at considerable risk. regards, tom lane
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
On 5/5/21 3:36 PM, Justin Pryzby wrote: > On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote: >>> I'm currently using the pg_read_file()-hack in a project, >>> and even though it can read files up to 1GB, >>> using e.g. regexp_split_to_table() to split on E'\n' >>> seems to need 4x as much memory, so it only >>> works with files less than ~256MB. >> Yeah, that's because of the conversion to "chr". But a regexp >> is overkill for that anyway. Don't we have something that will >> split on simple substring matches? > For v14 > > commit 66f163068030b5c5fe792a0daee27822dac43791 > Author: Tom Lane > Date: Wed Sep 2 18:23:56 2020 -0400 > > Add string_to_table() function. > Ha! just in time :-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
Andrew Dunstan writes: > On 5/5/21 2:45 PM, Tom Lane wrote: >> Yeah, that's because of the conversion to "chr". But a regexp >> is overkill for that anyway. Don't we have something that will >> split on simple substring matches? > Not that I know of. There is split_part but I don't think that's fit for > purpose here. Do we need one, or have I missed something? [ checks manual ... ] string_to_array or string_to_table would do, I think. regards, tom lane
Re: .ready and .done files considered harmful
On Wed, May 5, 2021 at 1:06 PM Stephen Frost wrote: > It's not just about making sure that we archive the history file for a > timeline before archiving WAL segments along that timeline but also > about making sure we get that history file into the archive as fast as > we can, and archiving a 16MB WAL first would certainly delay that. Ooph. That's a rather tough constraint. Could we get around it by introducing some kind of signalling mechanism, perhaps? Like if there's a new history file, that must mean the server has switched timelines -- I think, anyway -- so if we notified the archiver every time there was a timeline switch it could react accordingly. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent wrote: > I believe that it cannot be "just" an additive thing, at least not > through a normal INCLUDEd column, as you'd get duplicate TIDs in the > index, with its related problems. You also cannot add it as a key > column, as this would disable UNIQUE indexes; one of the largest use > cases of global indexes. So, you must create specialized > infrastructure for this identifier. You're just quibbling about the precise words that I used. Of course it is true that there must be some sense in which a global index partition key attribute will need to be special to the implementation -- how else could a global index enforce uniqueness? That was clearly implied. > And when we're already adding specialized infrastructure, then this > should probably be part of a new TID infrastructure. This is a non-sequitur. > And if we're going to change TID infrastructure to allow for more > sizes (as we'd need normal TableAM TIDs, and global index > partition-identifying TIDs), I'd argue that it should not be too much > more difficult to create an infrastructure for 'new TID' in which the > table AM supplies type, size and strict ordering information for these > 'new TID's. > > And if this 'new TID' size is not going to be defined by the index AM > but by the indexed object (be it a table or a 'global' or whatever > we'll build indexes on), I see no reason why this 'new TID' > infrastructure couldn't eventually support variable length TIDs; or > constant sized usertype TIDs (e.g. the 3 int columns of the primary > key of a clustered table). You're not considering the big picture. It's not self-evident that anybody will ever have much use for a variable-width TID in their table AM, at least beyond some fairly simple scheme -- because of the fundamental issue of TID not working as a stable identifier of logical rows in Postgres. If it was very clear that there would be *some* significant benefit then the costs might start to look reasonable. But there isn't. "Build it and they will come" is not at all convincing to me. -- Peter Geoghegan
Re: .ready and .done files considered harmful
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, May 5, 2021 at 1:06 PM Stephen Frost wrote: > > It's not just about making sure that we archive the history file for a > > timeline before archiving WAL segments along that timeline but also > > about making sure we get that history file into the archive as fast as > > we can, and archiving a 16MB WAL first would certainly delay that. > > Ooph. That's a rather tough constraint. Could we get around it by > introducing some kind of signalling mechanism, perhaps? Like if > there's a new history file, that must mean the server has switched > timelines -- I think, anyway -- so if we notified the archiver every > time there was a timeline switch it could react accordingly. I would think something like that would be alright and not worse than what we've got now. That said, in an ideal world, we'd have a way to get the new timeline to switch to in a way that doesn't leave open race conditions, so as long we're talking about big changes to the way archiving and archive_command work (or about throwing out the horrible idea that is archive_command in the first place and replacing it with appropriate hooks such that someone could install an extension which would handle archiving...), I would hope we'd have a way of saying "please, atomically, go get me a new timeline." Just as a reminder for those following along at home, as I'm sure you're already aware, the way we figure out what timeline to switch to when a replica is getting promoted is that we go run the restore command asking for history files until we get back "nope, there is no file named 123.history", and then we switch to that timeline and then try to push such a history file into the repo and hope that it works. Thanks, Stephen signature.asc Description: PGP signature
Re: .ready and .done files considered harmful
On Wed, May 5, 2021 at 4:13 PM Stephen Frost wrote: > I would think something like that would be alright and not worse than > what we've got now. OK. > That said, in an ideal world, we'd have a way to get the new timeline to > switch to in a way that doesn't leave open race conditions, so as long > we're talking about big changes to the way archiving and archive_command > work (or about throwing out the horrible idea that is archive_command in > the first place and replacing it with appropriate hooks such that > someone could install an extension which would handle archiving...), I > would hope we'd have a way of saying "please, atomically, go get me a new > timeline." > > Just as a reminder for those following along at home, as I'm sure you're > already aware, the way we figure out what timeline to switch to when a > replica is getting promoted is that we go run the restore command asking > for history files until we get back "nope, there is no file named > 123.history", and then we switch to that timeline and then try to > push such a history file into the repo and hope that it works. Huh, I had not thought about that problem. So, at the risk of getting sidetracked, what exactly are you asking for here? Let the extension pick the timeline using an algorithm of its own devising, rather than having core do it? Or what? -- Robert Haas EDB: http://www.enterprisedb.com
Re: .ready and .done files considered harmful
Hi, On 2021-05-05 16:13:08 -0400, Stephen Frost wrote: > Just as a reminder for those following along at home, as I'm sure you're > already aware, the way we figure out what timeline to switch to when a > replica is getting promoted is that we go run the restore command asking > for history files until we get back "nope, there is no file named > 123.history", and then we switch to that timeline and then try to > push such a history file into the repo and hope that it works. Which is why the whole concept of timelines as we have them right now is pretty much useless. It is fundamentally impossible to guarantee unique timeline ids in all cases if they are assigned sequentially at timeline creation - consider needing to promote a node on both ends of a split network. I'm quite doubtful that pretending to tackle this problem via archiving order is a good idea, given the fundamentally racy nature. Greetings, Andres Freund
Re: .ready and .done files considered harmful
Hi, On 2021-05-05 16:22:21 -0400, Robert Haas wrote: > Huh, I had not thought about that problem. So, at the risk of getting > sidetracked, what exactly are you asking for here? Let the extension > pick the timeline using an algorithm of its own devising, rather than > having core do it? Or what? Not Stephen, but to me the most reasonable way to address this is to make timeline identifier wider and randomly allocated. The sequential looking natures of timelines imo is actively unhelpful. Greetings, Andres Freund
Re: .ready and .done files considered harmful
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, May 5, 2021 at 4:13 PM Stephen Frost wrote: > > That said, in an ideal world, we'd have a way to get the new timeline to > > switch to in a way that doesn't leave open race conditions, so as long > > we're talking about big changes to the way archiving and archive_command > > work (or about throwing out the horrible idea that is archive_command in > > the first place and replacing it with appropriate hooks such that > > someone could install an extension which would handle archiving...), I > > would hope we'd have a way of saying "please, atomically, go get me a new > > timeline." > > > > Just as a reminder for those following along at home, as I'm sure you're > > already aware, the way we figure out what timeline to switch to when a > > replica is getting promoted is that we go run the restore command asking > > for history files until we get back "nope, there is no file named > > 123.history", and then we switch to that timeline and then try to > > push such a history file into the repo and hope that it works. > > Huh, I had not thought about that problem. So, at the risk of getting > sidetracked, what exactly are you asking for here? Let the extension > pick the timeline using an algorithm of its own devising, rather than > having core do it? Or what? Having the extension do it somehow is an interesting idea and one which might be kind of cool. The first thought I had was to make it archive_command's job to "pick" the timeline by just re-trying to push the .history file (the actual contents of it don't change, as the information in the file is about the timeline we are switching *from* and at what LSN). That requires an archive command which will fail if that file already exists though and, ideally, would perform the file archival in an atomic fashion (though this last bit isn't stricly necessary- anything along these lines would certainly be better than the current state). Having an entirely independent command/hook that's explicitly for this case would be another approach, of course, either in a manner that allows the extension to pick the destination timeline or is defined to be "return success only if the file is successfully archived, but do *not* overwrite any existing file of the same name and return an error instead." and then the same approach as outlined above. Thanks, Stephen signature.asc Description: PGP signature
Re: Dubious assertion in RegisterDynamicBackgroundWorker
On Wed, May 5, 2021 at 3:46 PM Tom Lane wrote: > Admittedly, it seems unlikely that the difference could exceed > MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where > the limit on number of parallel workers is only 8. What I think is > more likely, given that these counters are unsigned, is that the > difference was actually negative. Which could be a bug, or it could > be an expectable race condition, or it could just be some flakiness > on lorikeet's part (that machine has had a lot of issues lately). I think that assertion was added by me, and I think the thought process was that the value shouldn't go negative and that if it does it's probably a bug which we might want to fix. But since the values are unsigned I could hardly check for < 0, so I did it this way instead. But since there's no memory barrier between the two loads, I guess there's no guarantee that they have the expected relationship, even if there is a memory barrier on the store side. I wonder if it's worth trying to tighten that up so that the assertion is more meaningful, or just give up and rip it out. I'm afraid that if we do have (or develop) bugs in this area, someone will discover that the effective max_parallel_workers value on their system slowly drifts up or down from the configured value, and we'll have no clue where things are going wrong. The assertion was intended to give us a chance of noticing that sort of problem in the buildfarm or on a developer's machine before the code gets out into the real world. -- Robert Haas EDB: http://www.enterprisedb.com
Re: .ready and .done files considered harmful
On Wed, May 5, 2021 at 4:31 PM Andres Freund wrote: > On 2021-05-05 16:22:21 -0400, Robert Haas wrote: > > Huh, I had not thought about that problem. So, at the risk of getting > > sidetracked, what exactly are you asking for here? Let the extension > > pick the timeline using an algorithm of its own devising, rather than > > having core do it? Or what? > > Not Stephen, but to me the most reasonable way to address this is to > make timeline identifier wider and randomly allocated. The sequential > looking natures of timelines imo is actively unhelpful. Yeah, I always wondered why we didn't assign them randomly. -- Robert Haas EDB: http://www.enterprisedb.com
Re: .ready and .done files considered harmful
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, May 5, 2021 at 4:31 PM Andres Freund wrote: > > On 2021-05-05 16:22:21 -0400, Robert Haas wrote: > > > Huh, I had not thought about that problem. So, at the risk of getting > > > sidetracked, what exactly are you asking for here? Let the extension > > > pick the timeline using an algorithm of its own devising, rather than > > > having core do it? Or what? > > > > Not Stephen, but to me the most reasonable way to address this is to > > make timeline identifier wider and randomly allocated. The sequential > > looking natures of timelines imo is actively unhelpful. > > Yeah, I always wondered why we didn't assign them randomly. Based on what we do today regarding the info we put into .history files, trying to figure out which is the "latest" timeline might be a bit tricky with randomly selected timelines. Maybe we could find a way to solve that though. I do note that this comment is timeline.c is, ahem, perhaps over-stating things a bit: * Note: while this is somewhat heuristic, it does positively guarantee * that (result + 1) is not a known timeline, and therefore it should * be safe to assign that ID to a new timeline. Thanks, Stephen signature.asc Description: PGP signature
Re: Bogus collation version recording in recordMultipleDependencies
On 4/21/21 4:28 PM, Andres Freund wrote: > Hi, > > On 2021-04-20 12:05:27 +1200, Thomas Munro wrote: >> I'll hold off reverting for a few more days to see if anyone has any >> other thoughts on that, because there doesn't seem to be any advantage >> in being too hasty about it. > I'm not really convinced that this is warranted, and that it isn't > better addressed by reducing the scope of the feature: > > When using index collation versions to decide whether to reindex > individual indexes it is important to not have any false negatives - > otherwise the feature could trigger corruption. > > However, the feature has a second, IMO more crucial, aspect: Preventing > silent corruption due to collation changes. There are regular reports of > people corrupting their indexes (and subsequently constraints) due to > collation changes (or collation differences between primary/replica). > To be effective detecting such cases it is not required to catch 100% of > all dangerous cases, just that a high fraction of cases is caught. > > And handling the composite type case doesn't seem like it'd impact the > percentage of detected collation issues all that much. For one, indexes > on composite types aren't all that common, and additing new columns to > those composite types is likely even rarer. For another, I'd expect that > nearly all databases that have indexes on composite types also have > indexes on non-composite text columns - which'd be likely to catch the > issue. > > Given that this is a regularly occurring source of corruption for users, > and not one just negligent operators run into (we want people to upgrade > OS versions), I think we ought to factor that into our decision what to > do. > Hi, this is an open item for release 14 . The discussion seems to have gone silent for a couple of weeks. Are we in a position to make any decisions? I hear what Andres says, but is anyone acting on it? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
cache lookup failed for statistics object 123
Per sqlsmith. postgres=# SELECT pg_get_statisticsobjdef_expressions(123); ERROR: cache lookup failed for statistics object 123 postgres=# \errverbose ERROR: XX000: cache lookup failed for statistics object 123 LOCATION: pg_get_statisticsobjdef_expressions, ruleutils.c:1762 The expectation is that sql callable functions should return null rather than hitting elog(). In the 003 patch, I wonder if this part should be updated, too: | ... which can greatly improve query plans that use the expression index. It can improve queries even that don't use the index, right ? Say, if a query has f(x) = 11, and the MCV list for the expression shows that 50% of the table has f(x)=11, then the query might decide to *not* use an index scan. -- Justin >From 6dec09300e4ad6cc7977acbfee9db7087420a9b5 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 5 May 2021 04:29:00 -0500 Subject: [PATCH 1/3] Return NULL rather than elog(ERROR) for sql-callable function --- src/backend/utils/adt/ruleutils.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 0a4fa93d01..881e8ec03d 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -1759,9 +1759,9 @@ pg_get_statisticsobjdef_expressions(PG_FUNCTION_ARGS) statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid)); if (!HeapTupleIsValid(statexttup)) - elog(ERROR, "cache lookup failed for statistics object %u", statextid); + PG_RETURN_NULL(); - /* has the statistics expressions? */ + /* Does the stats object have expressions? */ has_exprs = !heap_attisnull(statexttup, Anum_pg_statistic_ext_stxexprs, NULL); /* no expressions? we're done */ -- 2.17.0 >From ede54e64cf5e2249fe6910d6f2c0de177a0edb9b Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 27 Apr 2021 07:57:50 -0500 Subject: [PATCH 2/3] Comment typos: extended stats a4d75c86b and 518442c7f --- src/backend/parser/parse_utilcmd.c | 2 +- src/backend/statistics/extended_stats.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 9dd30370da..eb9e63f4a8 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -1943,7 +1943,7 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid heapRelid, * simply append them after simple column references. * * XXX Some places during build/estimation treat expressions as if they - * are before atttibutes, but for the CREATE command that's entirely + * are before attributes, but for the CREATE command that's entirely * irrelevant. */ datum = SysCacheGetAttr(STATEXTOID, ht_stats, diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 7e11cb9d5f..5e53783ea6 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1796,7 +1796,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli continue; /* - * Now we know the clause is compatible (we have either atttnums + * Now we know the clause is compatible (we have either attnums * or expressions extracted from it), and was not estimated yet. */ -- 2.17.0 >From e64e4b3d206d76ed8bbd5345798e0d35958759bd Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 23 Apr 2021 09:15:40 -0500 Subject: [PATCH 3/3] Mention statistics objects Previously mentioned at 20210423025012.gi7...@telsasoft.com --- doc/src/sgml/maintenance.sgml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index ee6113926a..de7fd75e1c 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -330,7 +330,8 @@ Also, by default there is limited information available about - the selectivity of functions. However, if you create an expression + the selectivity of functions. However, if you create a statistics + object or an expression index that uses a function call, useful statistics will be gathered about the function, which can greatly improve query plans that use the expression index. -- 2.17.0
Re: Bogus collation version recording in recordMultipleDependencies
On Thu, May 6, 2021 at 8:58 AM Andrew Dunstan wrote: > this is an open item for release 14 . The discussion seems to have gone > silent for a couple of weeks. Are we in a position to make any > decisions? I hear what Andres says, but is anyone acting on it? I'm going to revert this and resubmit for 15. That'll give proper time to reconsider the question of whether pg_depend is right for this, and come up with a non-rushed response to the composite type problem etc.
Re: Dubious assertion in RegisterDynamicBackgroundWorker
Robert Haas writes: > On Wed, May 5, 2021 at 3:46 PM Tom Lane wrote: >> Admittedly, it seems unlikely that the difference could exceed >> MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where >> the limit on number of parallel workers is only 8. What I think is >> more likely, given that these counters are unsigned, is that the >> difference was actually negative. Which could be a bug, or it could >> be an expectable race condition, or it could just be some flakiness >> on lorikeet's part (that machine has had a lot of issues lately). > I think that assertion was added by me, and I think the thought > process was that the value shouldn't go negative and that if it does > it's probably a bug which we might want to fix. But since the values > are unsigned I could hardly check for < 0, so I did it this way > instead. > But since there's no memory barrier between the two loads, I guess > there's no guarantee that they have the expected relationship, even if > there is a memory barrier on the store side. I wonder if it's worth > trying to tighten that up so that the assertion is more meaningful, or > just give up and rip it out. I'm afraid that if we do have (or > develop) bugs in this area, someone will discover that the effective > max_parallel_workers value on their system slowly drifts up or down > from the configured value, and we'll have no clue where things are > going wrong. The assertion was intended to give us a chance of > noticing that sort of problem in the buildfarm or on a developer's > machine before the code gets out into the real world. I follow your concern, but I'm not convinced that this assertion is a useful aid; first because the asynchrony involved makes the edge cases rather squishy, and second because allowing 1024 bogus increments before complaining will likely mean that developer test runs will not last long enough to trigger the assertion, and third because if it does fire it's too far removed from the perpetrator to be much help in figuring out what went wrong, or even if anything *is* wrong. I've not tried to trace the code, but I'm now a bit suspicious that there is indeed a design bug here. I gather from the comments that parallel_register_count is incremented by the worker processes, which of course implies that a worker that fails to reattach to shared memory won't do that. But parallel_terminate_count is incremented by the postmaster. If the postmaster will do that even in the case of a worker that failed at startup, then lorikeet's symptoms are neatly explained. I'd be more comfortable with this code if the increments and decrements were handled by the same process. regards, tom lane
Re: Bogus collation version recording in recordMultipleDependencies
On 5/5/21 5:12 PM, Thomas Munro wrote: > On Thu, May 6, 2021 at 8:58 AM Andrew Dunstan wrote: >> this is an open item for release 14 . The discussion seems to have gone >> silent for a couple of weeks. Are we in a position to make any >> decisions? I hear what Andres says, but is anyone acting on it? > I'm going to revert this and resubmit for 15. That'll give proper > time to reconsider the question of whether pg_depend is right for > this, and come up with a non-rushed response to the composite type > problem etc. OK, thanks. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Wed, 5 May 2021 at 22:09, Peter Geoghegan wrote: > > On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent > wrote: > > I believe that it cannot be "just" an additive thing, at least not > > through a normal INCLUDEd column, as you'd get duplicate TIDs in the > > index, with its related problems. You also cannot add it as a key > > column, as this would disable UNIQUE indexes; one of the largest use > > cases of global indexes. So, you must create specialized > > infrastructure for this identifier. > > You're just quibbling about the precise words that I used. Of course > it is true that there must be some sense in which a global index > partition key attribute will need to be special to the implementation > -- how else could a global index enforce uniqueness? That was clearly > implied. This implication was not 100% clear to me, and the last thread on global indexes that implemented it through INCLUDEd columns didn't mention this. As such, I wanted to explicitly mention that this partition/table identifier would need to be part of the keyspace. > > And when we're already adding specialized infrastructure, then this > > should probably be part of a new TID infrastructure. > > This is a non-sequitur. I may have skipped some reasoning: I believe that the TID is the unique identifier of that tuple, within context. For normal indexes, the TID as supplied directly by the TableAM is sufficient, as the context is that table. For global indexes, this TID must include enough information to relate it to the table the tuple originated from. In the whole database, that would be the OID of the table + the TID as supplied by the table. As such, the identifier of the logical row (which can be called the TID), as stored in index tuples in global indexes, would need to consist of the TableAM supplied TID + the (local) id of the table containing the tuple. Assuming we're in agreement on that part, I would think it would be consistent to put this in TID infrastructure, such that all indexes that use such new TID infrastructure can be defined to be global with only minimal effort. > > And if we're going to change TID infrastructure to allow for more > > sizes (as we'd need normal TableAM TIDs, and global index > > partition-identifying TIDs), I'd argue that it should not be too much > > more difficult to create an infrastructure for 'new TID' in which the > > table AM supplies type, size and strict ordering information for these > > 'new TID's. > > > > And if this 'new TID' size is not going to be defined by the index AM > > but by the indexed object (be it a table or a 'global' or whatever > > we'll build indexes on), I see no reason why this 'new TID' > > infrastructure couldn't eventually support variable length TIDs; or > > constant sized usertype TIDs (e.g. the 3 int columns of the primary > > key of a clustered table). > > You're not considering the big picture. It's not self-evident that > anybody will ever have much use for a variable-width TID in their > table AM, at least beyond some fairly simple scheme -- because of the > fundamental issue of TID not working as a stable identifier of logical > rows in Postgres. ZHeap states that it can implement stable TIDs within limits, as IIRC it requires retail index deletion support for all indexes on the updated columns of that table. I fail to see why this same infrastructure could not be used for supporting clustered tables, while enforcing these limits only soft enforced in ZHeap (that is, only allowing index AMs that support retail index tuple deletion). > If it was very clear that there would be *some* > significant benefit then the costs might start to look reasonable. But > there isn't. "Build it and they will come" is not at all convincing to > me. Clustered tables / Index-oriented Tables are very useful for tables of which most columns are contained in the PK, or otherwise are often ordered by their PK. I don't know of any way that would allow us to build a clustered table _without_ including the primary key in some form into the TID, or otherwise introducing a layer of indirection that would undo the clustered access implicated by the clustered table. Additionally, compacting/re-clustering a table would be _much_ cheaper for clustered tables, as the indexes attached to that table would not need rebuilding: all TIDs will stay valid across the clustering operation. With regards, Matthias van de Meent
Make some column descriptions easier to distinguish visually
Folks, I was writing up a query on pg_constraint, and the columns whose descriptions I've changed here were pretty hard to puzzle out, as they were only distinct up to the difference between F and P, which isn't always easy to see. Please find attached a patch to disambiguate them. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 4202229a9319fa2f307b2761696775a2c5905fcd Mon Sep 17 00:00:00 2001 From: David Fetter Date: Wed, 5 May 2021 15:48:53 -0700 Subject: [PATCH v1] Clarify some column descriptions in pg_constraint To: hackers MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.31.1" This is a multi-part message in MIME format. --2.31.1 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit diff --git doc/src/sgml/catalogs.sgml doc/src/sgml/catalogs.sgml index 492ed348b3..2c501f21f7 100644 --- doc/src/sgml/catalogs.sgml +++ doc/src/sgml/catalogs.sgml @@ -2664,7 +2664,7 @@ SCRAM-SHA-256$:&l (references pg_operator.oid) - If a foreign key, list of the equality operators for PK = FK comparisons + If a foreign key, list of the equality operators for establishing whether primary key columns are equal to the corresponding foreign key columns (PK = FK) @@ -2674,7 +2674,7 @@ SCRAM-SHA-256$ :&l (references pg_operator.oid) - If a foreign key, list of the equality operators for PK = PK comparisons + If a foreign key, list of the equality operators for establishing whether primary key columns are equal (PK = PK) @@ -2684,7 +2684,7 @@ SCRAM-SHA-256$ :&l (references pg_operator.oid) - If a foreign key, list of the equality operators for FK = FK comparisons + If a foreign key, list of the equality operators for establishing whether foreign key columns are equal (FK = FK) --2.31.1--
Re: v14 mechanical code beautification patches
On Wed, May 5, 2021 at 02:18:04PM -0400, Tom Lane wrote: > I notice that we also list this as a pre-beta task in > src/tools/RELEASE_CHANGES: > > * Update inet/cidr data types with newest Bind patches > > However, I can't recall that anyone has ever done any such thing; > and at this point, any attempt to re-sync that code would likely > be a rather major task. Should we take that off the checklist? I think it is related to these files: src/backend/utils/adt/inet_cidr_ntop.c src/backend/utils/adt/inet_net_pton.c which have at the top: * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") * Copyright (c) 1996,1999 by Internet Software Consortium. but I am not sure we still need to update those, so I would remove it. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: v14 mechanical code beautification patches
Bruce Momjian writes: > On Wed, May 5, 2021 at 02:18:04PM -0400, Tom Lane wrote: >> I notice that we also list this as a pre-beta task in >> src/tools/RELEASE_CHANGES: >> * Update inet/cidr data types with newest Bind patches >> However, I can't recall that anyone has ever done any such thing; >> and at this point, any attempt to re-sync that code would likely >> be a rather major task. Should we take that off the checklist? > I think it is related to these files: > src/backend/utils/adt/inet_cidr_ntop.c > src/backend/utils/adt/inet_net_pton.c > which have at the top: > * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") > * Copyright (c) 1996,1999 by Internet Software Consortium. > but I am not sure we still need to update those, so I would remove it. I dug in the archives and found the thread that prompted you to add that bullet item: https://www.postgresql.org/message-id/200502021700.j12H05j20872%40candle.pha.pa.us which made the point that those were moving targets back in 2005. I doubt they still are, so I don't see much point in keeping this in the checklist. (There may or may not be value in doing a one-time check to see if we've missed anything.) regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent wrote: > I believe that the TID is the unique identifier of that tuple, within context. > > For normal indexes, the TID as supplied directly by the TableAM is > sufficient, as the context is that table. > For global indexes, this TID must include enough information to relate > it to the table the tuple originated from. Clearly something like a partition identifier column is sometimes just like a regular user-visible column, though occasionally not like one -- whichever is useful to the implementation in each context. For example, we probably want to do predicate pushdown, maybe with real cataloged operators that access the column like any other user-created column (the optimizer knows about the column, which even has a pg_attribute entry). Note that we only ever access the TID column using an insertion scankey today -- so there are several ways in which the partition identifier really would be much more like a user column than tid/scantid ever was. The TID is a key column for most purposes as of Postgres 12 (at least internally). That didn't break all unique indexes due to the existence of non-unique TIDs across duplicates! Insertions that must call _bt_check_unique() can deal with the issue directly, by temporarily unsetting scantid. We can easily do roughly the same thing here: be slightly creative about how we interpret whether or not the partition identifier is "just another key column" across each context. This is also similar to the way the implementation is slightly creative about NULL values, which are not equal to any other value to the user, but are nevertheless just another value from the domain of indexed values to the nbtree implementation. Cleverly defining the semantics of keys to get better performance and to avoid the need for special case code is more or less a standard technique. > In the whole database, that would be the OID of the table + the TID as > supplied by the table. > > As such, the identifier of the logical row (which can be called the > TID), as stored in index tuples in global indexes, would need to > consist of the TableAM supplied TID + the (local) id of the table > containing the tuple. 2 points: 1. Clearly you need to use the partition identifier with the TID in order to look up the version in the table -- you need to use both together in global indexes. But it can still work in much the same way as it would in a standard index -- it's just that you handle that extra detail as well. That's what I meant by additive. 2. If a TID points to a version of a row (or whatever you want to call the generalized version of a HOT chain -- almost the same thing), then of course you can always map it back to the logical row. That must always be true. It is equally true within a global index. Points 1 and 2 above seem obvious to me...so I think we agree on that much. I just don't know how you go from here to "we need variable-width TIDs". In all sincerity, I am confused because to me it just seems as if you're asserting that it must be necessary to have variable width TIDs again and again, without ever getting around to justifying it. Or even trying to. > Assuming we're in agreement on that part, I > would think it would be consistent to put this in TID infrastructure, > such that all indexes that use such new TID infrastructure can be > defined to be global with only minimal effort. Abstract definitions can be very useful, but ultimately they're just tools. They're seldom useful as a starting point in my experience. I try to start with the reality on the ground, and perhaps arrive at some kind of abstract model or idea much later. > ZHeap states that it can implement stable TIDs within limits, as IIRC > it requires retail index deletion support for all indexes on the > updated columns of that table. Whether or not that's true is not at all clear. What is true is that the prototype version of zheap that we have as of today is notable in that it more or less allows the moral equivalent of a HOT chain to be arbitrarily long (or much longer, at least). To the best of my knowledge there is nothing about retail index tuple deletion in the design, except perhaps something vague and aspirational. > I fail to see why this same > infrastructure could not be used for supporting clustered tables, > while enforcing these limits only soft enforced in ZHeap (that is, > only allowing index AMs that support retail index tuple deletion). You're ignoring an ocean of complexity here. Principally the need to implement something like two-phase locking (key value locking) in indexes to make this work, but also the need to account for how fundamentally redefining TID breaks things. To say nothing of how this might affect crash recovery. > > If it was very clear that there would be *some* > > significant benefit then the costs might start to look reasonable. But > > there isn't. "Build it and they will come" is not at all convincing to > > m
Re: v14 mechanical code beautification patches
On Wed, May 5, 2021 at 07:08:35PM -0400, Tom Lane wrote: > > I think it is related to these files: > > src/backend/utils/adt/inet_cidr_ntop.c > > src/backend/utils/adt/inet_net_pton.c > > which have at the top: > > * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") > > * Copyright (c) 1996,1999 by Internet Software Consortium. > > but I am not sure we still need to update those, so I would remove it. > > I dug in the archives and found the thread that prompted you to > add that bullet item: > > https://www.postgresql.org/message-id/200502021700.j12H05j20872%40candle.pha.pa.us > > which made the point that those were moving targets back in 2005. > I doubt they still are, so I don't see much point in keeping this > in the checklist. > > (There may or may not be value in doing a one-time check to see > if we've missed anything.) Thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Small issues with CREATE TABLE COMPRESSION
On Wed, May 05, 2021 at 09:59:41AM -0400, Robert Haas wrote: > TAP tests have a facility for conditionally skipping tests; see > perldoc Test::More. That's actually superior to what you can do with > pg_regress. We'd need to come up with some logic to determine when to > skip or not, though. Perhaps the easiest method would be to have the > relevant Perl script try to create a table with an lz4 column. If that > works, then perform the LZ4-based tests. If it fails, check the error > message. If it says anything that LZ4 is not supported by this build, > skip those tests. If it says anything else, die. There is a simpler and cheaper method to make the execution of TAP test conditional. As in src/test/ssl/t/002_scram.pl for channel binding, I think that you could use something like check_pg_config("#define HAVE_LIBLZ4 1") and use its result to decide which tests to skip or not. -- Michael signature.asc Description: PGP signature
Re: .ready and .done files considered harmful
On Wed, May 5, 2021 at 4:53 PM Stephen Frost wrote: > I do note that this comment is timeline.c is, ahem, perhaps over-stating > things a bit: > > * Note: while this is somewhat heuristic, it does positively guarantee > * that (result + 1) is not a known timeline, and therefore it should > * be safe to assign that ID to a new timeline. OK, that made me laugh out loud. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Dubious assertion in RegisterDynamicBackgroundWorker
On Wed, May 5, 2021 at 5:22 PM Tom Lane wrote: > I follow your concern, but I'm not convinced that this assertion is > a useful aid; first because the asynchrony involved makes the edge > cases rather squishy, and second because allowing 1024 bogus > increments before complaining will likely mean that developer test > runs will not last long enough to trigger the assertion, and third > because if it does fire it's too far removed from the perpetrator > to be much help in figuring out what went wrong, or even if > anything *is* wrong. Well, it allows 1024 bogus increments in one direction, but a negative value will trip it PDQ. > I've not tried to trace the code, but I'm now a bit suspicious > that there is indeed a design bug here. I gather from the > comments that parallel_register_count is incremented by the > worker processes, which of course implies that a worker that > fails to reattach to shared memory won't do that. But > parallel_terminate_count is incremented by the postmaster. > If the postmaster will do that even in the case of a worker that > failed at startup, then lorikeet's symptoms are neatly explained. parallel_register_count is incremented by RegisterDynamicBackgroundWorker, i.e. when the worker process is requested, not after it starts. To try to do it from the worker after it's launched would, as you suppose, be completely busted. The more general point here is that we should be adjusting these at the same point we allocate and free the worker slots. Wherever that happens is the right place to do this, because the goal of the mechanism is precisely to limit the number of such slots that can be used by parallel query. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Replication slot stats misgivings
On Tue, May 4, 2021 at 2:34 PM vignesh C wrote: > > On Tue, May 4, 2021 at 9:48 AM Masahiko Sawada wrote: > > > > On Mon, May 3, 2021 at 10:21 PM Amit Kapila wrote: > > > > > > On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada > > > wrote: > > > > > > > > On Mon, May 3, 2021 at 2:27 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila > > > > > wrote: > > > > > > > > > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada > > > > > > wrote: > > > > > > > > > > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I am not sure if any of these alternatives are a good idea. > > > > > > > > What do > > > > > > > > you think? Do you have any other ideas for this? > > > > > > > > > > > > > > I've been considering some ideas but don't come up with a good one > > > > > > > yet. It’s just an idea and not tested but how about having > > > > > > > CreateDecodingContext() register before_shmem_exit() callback > > > > > > > with the > > > > > > > decoding context to ensure that we send slot stats even on > > > > > > > interruption. And FreeDecodingContext() cancels the callback. > > > > > > > > > > > > > > > > > > > Is it a good idea to send stats while exiting and rely on the same? > > > > > > I > > > > > > think before_shmem_exit is mostly used for the cleanup purpose so > > > > > > not > > > > > > sure if we can rely on it for this purpose. I think we can't be sure > > > > > > that in all cases we will send all stats, so maybe Vignesh's patch > > > > > > is > > > > > > sufficient to cover the cases where we avoid losing it in cases > > > > > > where > > > > > > we would have sent a large amount of data. > > > > > > > > > > > > > > > > Sawada-San, any thoughts on this point? > > > > > > > > before_shmem_exit is mostly used to the cleanup purpose but how about > > > > on_shmem_exit()? pgstats relies on that to send stats at the > > > > interruption. See pgstat_shutdown_hook(). > > > > > > > > > > Yeah, that is worth trying. Would you like to give it a try? > > > > Yes. > > > > In this approach, I think we will need to have a static pointer in > > logical.c pointing to LogicalDecodingContext that we’re using. At > > StartupDecodingContext(), we set the pointer to the just created > > LogicalDecodingContext and register the callback so that we can refer > > to the LogicalDecodingContext on that callback. And at > > FreeDecodingContext(), we reset the pointer to NULL (however, since > > FreeDecodingContext() is not called when an error happens we would > > need to ensure resetting it somehow). But, after more thought, if we > > have the static pointer in logical.c it would rather be better to have > > a global function that sends slot stats based on the > > LogicalDecodingContext pointed by the static pointer and can be called > > by ReplicationSlotRelease(). That way, we don’t need to worry about > > erroring out cases as well as interruption cases, although we need to > > have a new static pointer. > > > > I've attached a quick-hacked patch. I also incorporated the change > > that calls UpdateDecodingStats() at FreeDecodingContext() so that we > > can send slot stats also in the case where we spilled/streamed changes > > but finished without commit/abort/prepare record. > > > > > I think > > > it still might not cover the cases where we error out in the backend > > > while decoding via APIs because at that time we won't exit, maybe for > > > that we can consider Vignesh's patch. > > > > Agreed. It seems to me that the approach of the attached patch is > > better than the approach using on_shmem_exit(). So if we want to avoid > > having the new static pointer and function for this purpose we can > > consider Vignesh’s patch. > > > > I'm ok with using either my patch or Sawada san's patch, Even I had > the same thought of whether we should have a static variable thought > pointed out by Sawada san. Apart from that I had one minor comment: > This comment needs to be corrected "andu sed to sent" > +/* > + * Pointing to the currently-used logical decoding context andu sed to sent > + * slot statistics on releasing slots. > + */ > +static LogicalDecodingContext *MyLogicalDecodingContext = NULL; > + Right, that needs to be fixed. After more thought, I'm concerned that my patch's approach might be invasive for PG14. Given that Vignesh’s patch would cover most cases, I think we can live with a small downside that could miss some slot stats. If we want to ensure sending slot stats at releasing slot, we can develop it as an improvement. My patch would be better to get reviewed by more peoples including the design during PG15 development. Thoughts? Regards, --- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent wrote: > I believe that it cannot be "just" an additive thing, at least not > through a normal INCLUDEd column, as you'd get duplicate TIDs in the > index, with its related problems. You also cannot add it as a key > column, as this would disable UNIQUE indexes; one of the largest use > cases of global indexes. So, you must create specialized > infrastructure for this identifier. > > And when we're already adding specialized infrastructure, then this > should probably be part of a new TID infrastructure. > > And if we're going to change TID infrastructure to allow for more > sizes (as we'd need normal TableAM TIDs, and global index > partition-identifying TIDs), I'd argue that it should not be too much > more difficult to create an infrastructure for 'new TID' in which the > table AM supplies type, size and strict ordering information for these > 'new TID's. > > And if this 'new TID' size is not going to be defined by the index AM > but by the indexed object (be it a table or a 'global' or whatever > we'll build indexes on), I see no reason why this 'new TID' > infrastructure couldn't eventually support variable length TIDs; or > constant sized usertype TIDs (e.g. the 3 int columns of the primary > key of a clustered table). > > The only requirements that I believe to be fundamental for any kind of TID are > > 1.) Uniqueness during the lifecycle of the tuple, from creation to > life to dead to fully dereferenced from all indexes; > 2.) There exists a strict ordering of all TIDs of that type; > > And maybe to supply some form of efficiency to the underlying tableAM: > > 3.) There should be an equivalent of bitmap for that TID type. > > For the nbtree deduplication subsystem, and for gin posting lists to > be able to work efficiently, the following must also hold: > > 4.) The TID type has a fixed size, preferably efficiently packable. > > Only the last requirement cannot be met with varlena TID types. But, > as I also believe that not all indexes can be expected to work (well) > for all kinds of TableAM, I don't see how this would be a blocking > issue. +1 to all of that. > Storage gains for index-oriented tables can become as large as the > size of the primary key by not having to store all primary key values > in both the index and the table; which can thus be around 100% of a > table in the least efficient cases of having a PK over all columns. > > Yes, this might be indeed only a 'small gain' for access latency, but > not needing to store another copy of your data (and keeping it in > cache, etc.) is a significant win in my book. This is a really good point. Also, if the table is ordered by a synthetic logical TID, range scans on the primary key will be less efficient than if the primary key is itself the TID. We have the ability to CLUSTER on an index for good reasons, and "Automatically maintain clustering on a table" has been on the todo list forever. It's hard to imagine this will ever be achieved with the current heap, though: the way to get there is to have a table AM for which this is an explicit goal. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Dubious assertion in RegisterDynamicBackgroundWorker
I wrote: > I've not tried to trace the code, but I'm now a bit suspicious > that there is indeed a design bug here. I gather from the > comments that parallel_register_count is incremented by the > worker processes, which of course implies that a worker that > fails to reattach to shared memory won't do that. But > parallel_terminate_count is incremented by the postmaster. > If the postmaster will do that even in the case of a worker that > failed at startup, then lorikeet's symptoms are neatly explained. That theory seems to be nonsense. After a bit more study of the code, I see that parallel_register_count is incremented by the *leader* process, when it reserves a BackgroundWorkerSlot for the worker. And parallel_terminate_count is incremented by the postmaster when it releases the slot; so it's darn hard to see how parallel_terminate_count could get ahead of parallel_register_count. I noticed that lorikeet's worker didn't fail at shared memory reattach, as I first thought, anyway. It failed at ERROR: could not map dynamic shared memory segment which means we ought to be able to reproduce the symptoms by faking failure of dsm_attach(), as I did in the quick hack attached. What I get is a lot of "parallel worker failed to initialize" and "lost connection to parallel worker" errors, but no assertion. (I also tried this with an EXEC_BACKEND build, just in case that'd change the behavior, but it didn't.) So it seems like the "lorikeet is flaky" theory is looking pretty plausible. I do see what seems to be a bug-let in ForgetBackgroundWorker. BackgroundWorkerStateChange is careful to do this when freeing a slot: /* * We need a memory barrier here to make sure that the load of * bgw_notify_pid and the update of parallel_terminate_count * complete before the store to in_use. */ notify_pid = slot->worker.bgw_notify_pid; if ((slot->worker.bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerData->parallel_terminate_count++; pg_memory_barrier(); slot->pid = 0; slot->in_use = false; but the mainline case in ForgetBackgroundWorker is a lot less paranoid: Assert(rw->rw_shmem_slot < max_worker_processes); slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot]; if ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerData->parallel_terminate_count++; slot->in_use = false; One of these functions is mistaken. However, I can't construct a theory whereby that explains lorikeet's symptoms, mainly because Intel chips don't do out-of-order stores so the messing with parallel_terminate_count should be done before in_use is cleared, even without an explicit memory barrier. regards, tom lane diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 3550ef13ba..d3b00c2f9e 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -1278,6 +1278,11 @@ ParallelWorkerMain(Datum main_arg) "Parallel worker", ALLOCSET_DEFAULT_SIZES); + if (random() < INT_MAX / 64) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("fake failure to map dynamic shared memory segment"))); + /* * Attach to the dynamic shared memory segment for the parallel query, and * find its table of contents.
Re: PG in container w/ pid namespace is init, process exits cause restart
On Tue, May 4, 2021 at 4:35 PM Tom Lane wrote: > I'm still thinking that we're best off refusing to do > that and making people install one of these shims that's meant > for the job. I have to admit that I care less about the specific issue here than about the general issue of being open to hearing what the user needs actually are. I honestly have no idea whether it's sensible to want to run postgres as init. If people who know about container stuff say that's a dumb idea and you shouldn't do it, then IMHO your conclusion that we should simply disallow it is 100% correct. But if those people show up and say, no, it's actually super-convenient for postgres to run as init and using one of those shim things has significant downsides that are hard to mitigate, and if further we could do what they say they need with just a little bit of extra code, then IMHO your conclusion is 100% wrong. Now so far as I can see right now neither conclusion is crystal clear - opinions seem to be a bit mixed. So right now I don't really know what to think. I just don't want to fall into the trap of thinking that core developers are somehow in a better place to know the right answer than users. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
How hard would it be to declare TID as current ItemPointerData with some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?). And then commit to fixing usage outside access/heap/ which depend on small value for MaxHeapTuplesPerPage, currently only nbodes/tidscan.c , access/gin/ginpostinglist.c and access/brin/brin_*.c there is also MaxHeapTuplesPerPage usage in ./backend/storage/page/bufpage.c but it seems to be all in heapam-dependent functions (PageRepairFragmentation(), PageGetHeapFreeSpace() and few others) which most likely should be moved to access/heap/ Doing it this way would leave us with some manageable complexity in mapping from TID to 48-bit integer and/or 3 wanted positions in 2^32 Hannu Krosing On Wed, May 5, 2021 at 8:40 PM Peter Geoghegan wrote: > > On Wed, May 5, 2021 at 11:25 AM Andres Freund wrote: > > Agreed. And we can increase the fit a good bit without needing invasive > > all-over changes. With those changes often even helping heap. > > > > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem > > even for heap - we waste a lot of space that's not commonly used. A > > better datastructure (radix tree like I'd say, but several tree shaped > > approaches seem possible). > > Agreed -- even if we only cared about heapam we still ought to do > something about tidbitmap.c's use of MaxHeapTuplesPerPage. > > -- > Peter Geoghegan > >
Re: PG in container w/ pid namespace is init, process exits cause restart
Robert Haas writes: > I have to admit that I care less about the specific issue here than > about the general issue of being open to hearing what the user needs > actually are. I honestly have no idea whether it's sensible to want to > run postgres as init. If people who know about container stuff say > that's a dumb idea and you shouldn't do it, then IMHO your conclusion > that we should simply disallow it is 100% correct. But if those people > show up and say, no, it's actually super-convenient for postgres to > run as init and using one of those shim things has significant > downsides that are hard to mitigate, and if further we could do what > they say they need with just a little bit of extra code, then IMHO > your conclusion is 100% wrong. Now so far as I can see right now > neither conclusion is crystal clear - opinions seem to be a bit mixed. > So right now I don't really know what to think. I just don't want to > fall into the trap of thinking that core developers are somehow in a > better place to know the right answer than users. I don't claim to have an opinion about how convenient it would be for users to not need an init shim. I do claim to have a qualified opinion about how hard it would be for us to support the case. It'd hobble our ability to detect child-process bookkeeping errors, and it'd put constraints on how we manage the postmaster's signal handling. Maybe those constraints will never matter, but that's a contract I don't really want to buy into for this seemingly-not-large benefit. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Thu, May 6, 2021 at 3:07 AM Robert Haas wrote: > > On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent > > Storage gains for index-oriented tables can become as large as the > > size of the primary key by not having to store all primary key values > > in both the index and the table; which can thus be around 100% of a > > table in the least efficient cases of having a PK over all columns. > > > > Yes, this might be indeed only a 'small gain' for access latency, but > > not needing to store another copy of your data (and keeping it in > > cache, etc.) is a significant win in my book. > > This is a really good point. Also, if the table is ordered by a > synthetic logical TID, range scans on the primary key will be less > efficient than if the primary key is itself the TID. We have the > ability to CLUSTER on an index for good reasons, and "Automatically > maintain clustering on a table" has been on the todo list forever. > It's hard to imagine this will ever be achieved with the current heap, > though: the way to get there is to have a table AM for which this is > an explicit goal. But would this not have the downside that all the secondary indexes will blow up as they now need to have the full table row as the TID ? - Hannu Krosing
Re: Printing backtrace of postgres processes
Here's a cleaned-up copy of the doc text. Send a request to the backend with the specified process ID to log its backtrace. The backtrace will be logged at message level LOG. It will appear in the server log based on the log configuration set (See for more information), but will not be sent to the client regardless of . A backtrace will identify where exactly the backend process is currently executing. This may be useful to developers to diagnose stuck processes and other problems. This feature is not supported for the postmaster, logger, or statistics collector process. This feature will be available if PostgreSQL was built with the ability to capture backtracee. If not available, the function will return false and show a WARNING. Only superusers can request backends to log their backtrace. > - * this and related functions are not inlined. > + * this and related functions are not inlined. If edata pointer is valid > + * backtrace information will set in edata. will *be* set -- Justin
Re: pg_receivewal makes a bad daemon
On Wed, May 05, 2021 at 01:12:03PM -0400, Robert Haas wrote: > On Wed, May 5, 2021 at 12:34 PM Magnus Hagander wrote: > > Is this really a problem we should fix ourselves? Most daemon-managers > > today will happily be configured to automatically restart a daemon on > > failure with a single setting since a long time now. E.g. in systemd > > (which most linuxen uses now) you just set Restart=on-failure (or > > maybe even Restart=always) and something like RestartSec=10. > > > > That said, it wouldn't cover an fsync() error -- they will always > > restart. The way to handle that is for the operator to capture the > > error message perhaps, and just "deal with it"? > > Maybe, but if that's really a non-problem, why does postgres itself > restart, and have facilities to write and rotate log files? I feel > like this argument boils down to "a manual transmission ought to be > good enough for anyone, let's not have automatics." But over the years > people have found that automatics are a lot easier to drive. It may be > true that if you know just how to configure your system's daemon > manager, you can make all of this work, but it's not like we document > how to do any of that, and it's probably not the same on every > platform - Windows in particular - and, really, why should people have > to do this much work? If I want to run postgres in the background I > can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my > crontab to make sure it gets restarted within a few minutes even if > the postmaster dies. If I want to keep pg_receivewal running all the > time ... I need a whole pile of extra mechanism to work around its > inherent fragility. Documenting how that's typically done on modern > systems, as you propose further on, would be great, but I can't do it, > because I don't know how to make it work. Hence the thread. > > > Also, all the above also apply to pg_recvlogical, right? So if we do > > want to invent our own daemon-init-system, we should probably do one > > more generic that can handle both. > > Yeah. And I'm not really 100% convinced that trying to patch this > functionality into pg_receive{wal,logical} is the best way forward ... > but I'm not entirely convinced that it isn't, either. I think one of > the basic problems with trying to deploy PostgreSQL in 2021 is that it > needs so much supporting infrastructure and so much babysitting. > archive_command has to be a complicated, almost magical program we > don't provide, and we don't even tell you in the documentation that > you need it. If you don't want to use that, you can stream with > pg_receivewal instead, but now you need a complicated daemon-runner > mechanism that we don't provide or document the need for. You also > probably need a connection pooler that we don't provide, a failover > manager that we don't provide, and backup management software that we > don't provide. And the interfaces that those tools have to work with > are so awkward and primitive that even the tool authors can't always > get it right. So I'm sort of unimpressed by any arguments that boil > down to "what we have is good enough" or "that's the job of some other > piece of software". Too many things are the job of some piece of > software that doesn't really exist, or is only available on certain > platforms, or that has some other problem that makes it not usable for > everyone. People want to be able to download and use PostgreSQL > without needing a whole library of other bits and pieces from around > the Internet. We do use at least one bit and piece from around the internet to make our software usable, namely libreadline, the absence of which make psql pretty much unusable. That out of the way, am I understanding correctly that you're proposing that make tools for daemon-izing, logging, connection management, and failover, and ship same with PostgreSQL? I can see the appeal for people shipping proprietary forks of the PostgreSQL, especially ones under restrictive licenses, and I guess we could make a pretty good case for continuing to center those interests as we have since the Berkeley days. Rather than, or maybe as a successor to, wiring such things into each tool we ship that require them, I'd picture something along the lines of .sos that could then be repurposed, modified, etc., as we provide with the distribution as it is now. Another possibility would be to look around for mature capabilities that are cross-platform in the sense that they work on all the platforms we do. While I don't think it's likely we'd find them for all the above use cases under compatible licenses, it's probably worth a look. At worst, we'd get some idea of how (not) to design the APIs to them. I'm going to guess that anything with an incompatible license will upset people who are accustomed to ensuring that we have what legally amounts to an MIT license clean distribution, but I'm thinking that option is at least worth discussing, even if the immediate conse
Re: v14 mechanical code beautification patches
I wrote: > I dug in the archives and found the thread that prompted you to > add that bullet item: > https://www.postgresql.org/message-id/200502021700.j12H05j20872%40candle.pha.pa.us > which made the point that those were moving targets back in 2005. > I doubt they still are, so I don't see much point in keeping this > in the checklist. > (There may or may not be value in doing a one-time check to see > if we've missed anything.) I located the "current" versions of those files in libbind 6.0. (I put "current" in quotes because the file dates seem to be 2005-2008, so indeed development came to a stop a long time ago.) They are *very* different from what we have, though. Some of it is visibly cosmetic, but other parts have been rewritten quite a bit, so it's hard to tell if the functionality is identical. In the absence of a reason to think we have bugs that we need to fix, I'm not sure it's worth analyzing the differences in detail. I definitely wouldn't just adopt all the diffs blindly. In any case, that RELEASE_CHANGES item is clearly a dead letter now, so I'll go remove it. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > How hard would it be to declare TID as current ItemPointerData with > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?). I don't think there's consensus in this thread that we want to do that, but I'd be fine with it. It's possible but not trivial. tidbitmap.c would be the biggest challenge, I think. Regards, Jeff Davis
RE: [bug?] Missed parallel safety checks, and wrong parallel safety
From: Robert Haas > On Tue, May 4, 2021 at 11:47 PM Greg Nancarrow > wrote: > > Problem is, for built-in functions, the changes are allowed, but for > > some properties (like strict) the allowed changes don't actually take > > effect (this is what Amit was referring to - so why allow those > > changes?). > > It's because some of the function properties are cached in > > FmgrBuiltins[] (for a "fast-path" lookup for built-ins), according to > > their state at build time (from pg_proc.dat), but ALTER FUNCTION is > > just changing it in the system catalogs. Also, with sufficient > > privileges, a built-in function can be redefined, yet the original > > function (whose info is cached in FmgrBuiltins[]) is always invoked, > > not the newly-defined version. > > I agree. I think that's not ideal. I think we should consider putting > some more restrictions on updating system catalog changes, and I also > think that if we can get out of having strict need to be part of > FmgrBuiltins[] that would be good. But what I don't agree with is the > idea that since strict already has this problem, it's OK to do the > same thing with parallel-safety. That seems to me to be making a bad > situation worse, and I can't see what problem it actually solves. Let me divide the points: (1) Is it better to get hardcoded function properties out of fmgr_builtins[]? It's little worth doing so or thinking about that. It's no business for users to change system objects, in this case system functions. Also, hardcoding is a worthwhile strategy for good performance or other inevitable reasons. Postgres is using it as in the system catalog relcache below. [relcache.c] /* * hardcoded tuple descriptors, contents generated by genbki.pl */ static const FormData_pg_attribute Desc_pg_class[Natts_pg_class] = {Schema_pg_class}; static const FormData_pg_attribute Desc_pg_attribute[Natts_pg_attribute] = {Schema_pg_attribute}; ... (2) Should it be disallowed for users to change system function properties with ALTER FUNCTION? Maybe yes, but it's not an important issue for achieving parallel INSERT SELECT at the moment. So, I think this can be discussed in an independent separate thread. As a reminder, Postgres have safeguards against modifying system objects as follows. test=# drop table^C test=# drop function pg_wal_replay_pause(); ERROR: cannot drop function pg_wal_replay_pause() because it is required by the database system test=# drop table pg_largeobject; ERROR: permission denied: "pg_largeobject" is a system catalog OTOH, Postgres doesn't disallow changing the system table column values directly, such as UPDATE pg_proc SET But it's warned in the manual that such operations are dangerous. So, we don't have to care about it. Chapter 52. System Catalogs https://www.postgresql.org/docs/devel/catalogs.html "You can drop and recreate the tables, add columns, insert and update values, and severely mess up your system that way. Normally, one should not change the system catalogs by hand, there are normally SQL commands to do that. (For example, CREATE DATABASE inserts a row into the pg_database catalog — and actually creates the database on disk.) There are some exceptions for particularly esoteric operations, but many of those have been made available as SQL commands over time, and so the need for direct manipulation of the system catalogs is ever decreasing." (3) Why do we want to have parallel-safety in fmgr_builtins[]? As proposed in this thread and/or "Parallel INSERT SELECT take 2", we thought of detecting parallel unsafe function execution during SQL statement execution, instead of imposing much overhead to check parallel safety during query planning. Specifically, we add parallel safety check in fmgr_info() and/or FunctionCallInvoke(). (Alternatively, I think we can conclude that we assume parallel unsafe built-in functions won't be used in parallel DML. In that case, we don't change FmgrBuiltin and we just skip the parallel safety check for built-in functions when the function is called. Would you buy this?) Regards Takayuki Tsunakawa
Re: decoupling table and index vacuum
On Fri, Apr 23, 2021 at 5:01 AM Andres Freund wrote: > > Hi, > > On 2021-04-22 12:15:27 -0400, Robert Haas wrote: > > On Wed, Apr 21, 2021 at 5:38 PM Andres Freund wrote: > > > I'm not sure that's the only way to deal with this. While some form of > > > generic "conveyor belt" infrastructure would be a useful building block, > > > and it'd be sensible to use it here if it existed, it seems feasible to > > > dead tids in a different way here. You could e.g. have per-heap-vacuum > > > files with a header containing LSNs that indicate the age of the > > > contents. > > > > That's true, but have some reservations about being overly reliant on > > the filesystem to provide structure here. There are good reasons to be > > worried about bloating the number of files in the data directory. Hmm, > > but maybe we could mitigate that. First, we could skip this for small > > relations. If you can vacuum the table and all of its indexes using > > the naive algorithm in <10 seconds, you probably shouldn't do anything > > fancy. That would *greatly* reduce the number of additional files > > generated. Second, we could forget about treating them as separate > > relation forks and make them some other kind of thing entirely, in a > > separate directory > > I'm not *too* worried about this issue. IMO the big difference to the > cost of additional relation forks is that such files would only exist > when the table is modified to a somewhat meaningful degree. IME the > practical issues with the number of files due to forks are cases where > huge number of tables that are practically never modified exist. > > That's not to say that I am sure that some form of "conveyor belt" > storage *wouldn't* be the right thing. How were you thinking of dealing > with the per-relation aspects of this? One conveyor belt per relation? > > > > especially if we adopted Sawada-san's proposal to skip WAL logging. I > > don't know if that proposal is actually a good idea, because it > > effectively adds a performance penalty when you crash or fail over, > > and that sort of thing can be an unpleasant surprise. But it's > > something to think about. > > I'm doubtful about skipping WAL logging entirely - I'd have to think > harder about it, but I think that'd mean we'd restart from scratch after > crashes / immediate restarts as well, because we couldn't rely on the > contents of the "dead tid" files to be accurate. In addition to the > replication issues you mention. Yeah, not having WAL would have a big negative impact on other various aspects. Can we piggyback the WAL for the TID fork and XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in the list so that we can add dead TIDs to the TID fork during replaying XLOG_HEAP2_PRUNE. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: MaxOffsetNumber for Table AMs
- Hannu Krosing On Thu, May 6, 2021 at 4:53 AM Jeff Davis wrote: > > On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > > How hard would it be to declare TID as current ItemPointerData with > > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?). > > I don't think there's consensus in this thread that we want to do that, > but I'd be fine with it. Sure. I just proposed this as a Minimal Viable Change. Just hoping that we can agree on an interim solution which addresses the immediate problem and then continue arguing about the ideal way for the rest of v15 cycle (and the v16 and v17 ;) ) > > It's possible but not trivial. tidbitmap.c would be the biggest > challenge, I think. I think all these places (tidbitmap, gin, brin) relay on "relatively small" MaxHeapTuplesPerPage as an upper bound for some allocations and then still allocate a lot more than needed. One can get to 291 tids / page only when you create a table with no columns, or less than 8 columns which are all set to NULL. A table with a single non-null boolean column already can fit only 226 tuples per page. It is definitely a non-trivial amount of work to rewrite these three but going to (almost) full 48 bits from current theoretically-a-little-over-40-effective-bits would expand the number of addressable tuples 225 times. Of course it would be extra nice to also somehow encode the 3 special ItemPointerData values (NULL, 0xfffe, 0cfffd) "somewhere else" and get an option of uninterrupted 48-bit address space for non-heap table AMs, but this would likely be much more disruptive, if possible at all. We could still check, if they are used outside of heapam and maybe just fix these places. > > Regards, > Jeff Davis > > >
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar wrote: > > So, I > > used the other approach which led to the attached. > > The patch looks fine to me. > Thanks, pushed! -- With Regards, Amit Kapila.
Re: Identify missing publications from publisher while create/alter subscription.
On Tue, 04 May 2021 at 21:20, vignesh C wrote: > On Tue, May 4, 2021 at 2:37 PM Bharath Rupireddy > wrote: >> >> On Mon, May 3, 2021 at 7:59 PM vignesh C wrote: >> > Thanks for the comments, these comments are handle in the v7 patch >> > posted in my earlier mail. >> >> Thanks. Some comments on v7 patch: >> >> 1) How about "Add publication names from the list to a string." >> instead of >> * Append the list of publication to dest string. >> > > Modified. > >> 2) How about "Connect to the publisher and see if the given >> publication(s) is(are) present." >> instead of >> * Connect to the publisher and check if the publication(s) exist. >> > > Modified. > >> 3) Below comments are unnecessary as the functions/code following them >> will tell what the code does. >> /* Verify specified publication(s) exist in the publisher. */ >> /* We are done with the remote side, close connection. */ >> >> /* Verify specified publication(s) exist in the publisher. */ >> PG_TRY(); >> { >> check_publications(wrconn, publications, true); >> } >> PG_FINALLY(); >> { >> /* We are done with the remote side, close connection. */ >> walrcv_disconnect(wrconn); >> } >> > > Modified. > >> 4) And also the comment below that's there before check_publications >> is unnecessary, as the function name and description would say it all. >> /* Verify specified publication(s) exist in the publisher. */ >> > > Modified. > >> 5) A typo - it is "do not exist" >> # Multiple publications does not exist. >> > > Modified. > >> 6) Should we use "m" specified in all the test cases something like we >> do for $stderr =~ m/threads are not supported on this platform/ or >> m/replication slot "test_slot" was not created in this database/? >> $stderr =~ >> /ERROR: publication "non_existent_pub" does not exist in the >> publisher/, > > Modified. > > Thanks for the comments, Attached patch has the fixes for the same. Thanks for updating the patch. Some comments on v8 patch. 1) How about use appendStringInfoChar() to replace the first and last one, since it more faster. + appendStringInfoString(dest, "\""); + appendStringInfoString(dest, pubname); + appendStringInfoString(dest, "\""); 2) How about use if (!validate_publication) to keep the code style consistent? + if (validate_publication == false) + return; 3) Should we free the memory when finish the check_publications()? + publicationsCopy = list_copy(publications); 4) It is better wrap the word "streaming" with quote. Also, should we add 'no "validate_publication"' comment for validate_publication parameters? NULL, NULL, /* no "binary" */ - NULL, NULL); /* no streaming */ + NULL, NULL, /* no streaming */ + NULL, NULL); -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: MaxOffsetNumber for Table AMs
On Wed, May 5, 2021 at 10:53 PM Jeff Davis wrote: > On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > > How hard would it be to declare TID as current ItemPointerData with > > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?). > > I don't think there's consensus in this thread that we want to do that, > but I'd be fine with it. > > It's possible but not trivial. tidbitmap.c would be the biggest > challenge, I think. I think that would be fine, too. I don't think it's the ideal situation, but it seems like a clear improvement over what we have now. We might want to reserve a few values for future projects that might need distinguished values like SpecTokenOffsetNumber or MovedPartitionsOffsetNumber, though, so we don't completely box ourselves into a corner. -- Robert Haas EDB: http://www.enterprisedb.com