Re: Two patches to speed up pg_rewind.
Refactored the code a bit along with fixes. Manually tested them on centos & Ubuntu (the later has copy_file_range()) For the first patch, actually I have some concerns. My assumption is that the target pg_data directory should be fsync-ed already. This should be correct normally but there is one scenario: a cleanly-shutdown database’s pgdata directory was copied to another directory, in this case the new pgdata is not fsync-ed - I’m not sure if that exists in real production environment or not, but even considering this we could still use the optimization for the case that calls ensureCleanShutdown() since this ensures a pgdata fsync on the target. v2-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch Description: v2-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch v2-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch Description: v2-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Re: Freeze the inserted tuples during CTAS?
On Jan 27, 2021, at 5:33 PM, Darafei Komяpa Praliaskouski mailto:m...@komzpa.net>> wrote: Hi, I confirm that my analytic workflows often do the CTAS and VACUUM of the relation right after, before the index creation, to mark stuff as all-visible for IOS to work. Freezing and marking as visible will help. Thanks for letting me know there is such a real case in production environment. I attached the short patch. If no more other concerns, I will log the patch on commitfest. -Paul On Wed, Jan 27, 2021 at 12:29 PM Paul Guo mailto:gu...@vmware.com>> wrote: Here is the simple patch, diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index dce882012e..0391699423 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) myState->rel = intoRelationDesc; myState->reladdr = intoRelationAddr; myState->output_cid = GetCurrentCommandId(true); - myState->ti_options = TABLE_INSERT_SKIP_FSM; + myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN; MatView code already does this and COPY does this if specified. I’m not sure how does the community think about this. Actually personally I expect more about the all-visible setting due to TABLE_INSERT_FROZEN since I could easier use index only scan if we create an index and table use CTAS, else people have to use index only scan after vacuum. If people do not expect freeze could we at least introduce a option to specify the visibility during inserting? Regards, Paul v1-0001-Freeze-the-tuples-during-CTAS.patch Description: v1-0001-Freeze-the-tuples-during-CTAS.patch
Re: Freeze the inserted tuples during CTAS?
Attached is the v2 version that fixes a test failure due to plan change (bitmap index scan -> index only scan). v2-0001-Freeze-the-tuples-during-CTAS.patch Description: v2-0001-Freeze-the-tuples-during-CTAS.patch
Re: Freeze the inserted tuples during CTAS?
> On Mar 3, 2021, at 1:35 PM, Masahiko Sawada wrote: >> On Sun, Feb 21, 2021 at 4:46 PM Paul Guo wrote: >> Attached is the v2 version that fixes a test failure due to plan change >> (bitmap index scan -> index only scan). > I think this is a good idea. > BTW, how much does this patch affect the CTAS performance? I expect > it's negligible but If there is much performance degradation due to > populating visibility map, it might be better to provide a way to > disable it. Yes, this is a good suggestion. I did a quick test yesterday. Configuration: shared_buffers = 1280M and the test system memory is 7G. Test queries: checkpoint; \timing create table t1 (a, b, c, d) as select i,i,i,i from generate_series(1,2000) i; \timing select pg_size_pretty(pg_relation_size('t1')); Here are the running time: HEAD : Time: 10299.268 ms (00:10.299) + 1537.876 ms (00:01.538) Patch : Time: 12257.044 ms (00:12.257) + 14.247 ms The table size is 800+MB so the table should be all in the buffer. I was surprised to see the patch increases the CTAS time by 19.x%, and also it is not better than "CTAS+VACUUM" on HEAD version. In theory the visibility map buffer change should not affect that much. I looked at related code again (heap_insert()). I believe the overhead could decrease along with some discussed CTAS optimization solutions (multi-insert, or raw-insert, etc). I tested 'copy' also. The COPY FREEZE does not involve much overhead than COPY according to the experiement results as below. COPY uses multi-insert. Seems there is no other difference than CTAS when writing a new table. COPY TO + VACUUM Time: 8826.995 ms (00:08.827) + 1599.260 ms (00:01.599) COPY TO FREEZE + VACUUM Time: 8836.107 ms (00:08.836) + 13.581 ms So maybe think about doing freeze in CTAS after optimizing the CTAS performance later? By the way, ‘REFRESH MatView’ does freeze by default. Matview is quite similar to CTAS. I did test it also and the conclusion is similar to that of CTAS. Not sure why FREEZE was enabled though, maybe I missed something?
Re: fdatasync performance problem with large number of DB files
> On 2021/3/15, 7:34 AM, "Thomas Munro" wrote: >> On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro wrote: >> Time being of the essence, here is the patch I posted last year, this >> time with a GUC and some docs. You can set sync_after_crash to >> "fsync" (default) or "syncfs" if you have it. > Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of > typos. By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup. The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like backup_label(?) does not seem to be good ideas since 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command. 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might copy the pgdata directory and then we still need fsync. The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance. [1] https://www.postgresql.org/message-id/flat/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16%40vmware.com#734e7dc77f0760a3a64e808476ecc592
Re: Freeze the inserted tuples during CTAS?
> to set visibility map bits on materialized views. I'll start a new >thread to discuss that. Thanks. Also I withdrew the patch.
Re: fdatasync performance problem with large number of DB files
On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao wrote: > > > > On 2021/03/16 8:15, Thomas Munro wrote: > > On Tue, Mar 16, 2021 at 3:30 AM Paul Guo wrote: > >> By the way, there is a usual case that we could skip fsync: A fsync-ed > >> already standby generated by pg_rewind/pg_basebackup. > >> The state of those standbys are surely not > >> DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the > >> pgdata directory is fsync-ed again during startup when starting those pg > >> instances. We could ask users to not fsync > >> during pg_rewind&pg_basebackup, but we probably want to just fsync some > >> files in pg_rewind (see [1]), so better > >> let the startup process skip the unnecessary fsync? As to the solution, > >> using guc or writing something in some files like > >> backup_label(?) does not seem to be good ideas since > >> 1. Use guc, we still expect fsync after real crash recovery so we need to > >> reset the guc also need to specify pgoptions in pg_ctl command. > >> 2. Write some hint information to files like backup_label(?) in > >> pg_rewind/pg_basebackup, but people might > >> copy the pgdata directory and then we still need fsync. > >> The only one simple solution I can think out is to let user touch a file > >> to hint startup, before starting the pg instance. > > > > As a thought experiment only, I wonder if there is a way to make your > > touch-a-special-signal-file scheme more reliable and less dangerous > > (considering people might copy the signal file around or otherwise > > screw this up). It seems to me that invalidation is the key, and > > "unlink the signal file after the first crash recovery" isn't good > > enough. Hmm What if the file contained a fingerprint containing... > > let's see... checkpoint LSN, hostname, MAC address, pgdata path, ... hostname, mac address, or pgdata path (or e.g. inode of a file?) might be the same after vm cloning or directory copying though it is not usual. I can not figure out a stable solution that makes the information is out of date after vm/directory cloning/copying, so the simplest way seems to be that leaves the decision (i.e. touching a file) to users, instead of writing the information automatically by pg_rewind/pg_basebackup. > > (add more seasoning to taste), and then also some flags to say what is > > known to be fully fsync'd already: the WAL, pgdata but only as far as > > changes up to the checkpoint LSN, or all of pgdata? Then you could be > > conservative for a non-match, but skip the extra work in some common > > cases like pg_basebackup, as long as you trust the fingerprint scheme > > not to produce false positives. Or something like that... > > > > I'm not too keen to invent clever new schemes for PG14, though. This > > sync_after_crash=syncfs scheme is pretty simple, and has the advantage > > that it's very cheap to do it extra redundant times assuming nothing > > else is creating new dirty kernel pages in serious quantities. Is > > that useful enough? In particular it avoids the dreaded "open > > 1,000,000 uncached files over high latency network storage" problem. > > > > I don't want to add a hypothetical sync_after_crash=none, because it > > seems like generally a bad idea. We already have a > > running-with-scissors mode you could use for that: fsync=off. > > I heard that some backup tools sync the database directory when restoring it. > I guess that those who use such tools might want the option to disable such > startup sync (i.e., sync_after_crash=none) because it's not necessary. This scenario seems to be a support to the file touching solution since we do not have an automatic solution to skip the fsync. I thought using sync_after_crash=none to fix my issue but as I said we need to reset the guc since we still expect fsync/syncfs after the 2nd crash. > They can skip that sync by fsync=off. But if they just want to skip only that > startup sync and make subsequent recovery (or standby server) work with > fsync=on, they would need to shutdown the server after that startup sync > finishes, enable fsync, and restart the server. In this case, since the server > is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync > would not be performed. This procedure is tricky. So IMO supporting This seems to make the process complex. From the perspective of product design, this seems to be not attractive. > sync_after_crash=none would be helpful for this case and simple. Regards, Paul Guo (Vmware)
Re: Multi Inserts in CREATE TABLE AS - revived patch
> On Nov 9, 2020, at 6:41 PM, Bharath Rupireddy > wrote: > > On Tue, Nov 3, 2020 at 4:54 PM Bharath Rupireddy > wrote: >> >> If the approach followed in the patch looks okay, I can work on a separate >> patch for multi inserts in refresh materialized view cases. >> > > Hi, I'm attaching a v2 patch that has multi inserts for CTAS as well > as REFRESH MATERIALiZED VIEW. > > I did some testing: exec time in seconds. > > Use case 1: 1 int and 1 text column. each row size 129 bytes, size of > 1 text column 101 bytes, number of rows 100million, size of heap file > 12.9GB. > HEAD: 220.733, 220.428 > Patch: 151.923, 152.484 > > Thoughts? > > With Regards, > Bharath Rupireddy. > EnterpriseDB: > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F&data=04%7C01%7Cguopa%40vmware.com%7C2471a90558ce4bf0af5b08d8849c03bb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637405152899337347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QKeRMGQjOlOL%2FlQv%2BuEAb2ocLVq6zqXESKoNOaJ6YCo%3D&reserved=0 > Thanks for doing this. There might be another solution - use raw insert interfaces (i.e. raw_heap_insert()). Attached is the test (not formal) patch that verifies this idea. raw_heap_insert() writes the page into the table files directly and also write the FPI xlog when the tuples filled up the whole page. This seems be more efficient. In addition, those raw write interfaces call smgrimmedsync() when finishing raw inserting, this is because the write bypasses the shared buffer so a CHECKPOINT plus crash might cause data corruption since some FPI xlogs cannot be replayed and those table files are not fsync-ed during crash. It seems that a sync request could be forwarded to the checkpointer for each table segment file and then we do not need to call smgrimmedsync(). If the theory is correct this should be in a separate patch. Anyway I tested this idea also by simply commenting out the smgrimmedsync() call in heap_raw_insert_end() (a new function in the attached patch) since forwarding fsync request is light-weight. I did a quick and simple testing. The test environment is a centos6 vm with 7G memory on my Mac laptop. -O3 gcc compiler option; shared_buffers as 2GB. Did not check if parallel scanning is triggered by the test query and the data volume is not large so test time is not long. Here are the test script. create table t1 (a int, b int, c int, d int); insert into t1 select i,i,i,i from generate_series(1,1000) i; show shared_buffers; \timing on create table t2 as select * from t1; \timing off Here are the results: HEAD (37d2ff380312): Time: 5143.041 ms (00:05.143) Multi insert patch: Time: 4456.461 ms (00:04.456) Raw insert (attached): Time: 2317.453 ms (00:02.317) Raw insert + no smgrimmedsync(): Time: 2103.610 ms (00:02.104). From the above data raw insert is better; also forwarding sync should be able to improve further (Note my laptop is with SSD so on machine with SATA/SAS, I believe forwarding sync should be able to help more.) I tested removing smgrimmedsync in "vacuum full” code that uses raw insert also. FYI. HEAD: Time: 3567.036 ms (00:03.567) no smgrimmedsync: Time: 3023.487 ms (00:03.023) Raw insert could be used on CTAS & Create MatView. For Refresh MatView the code is a bit different. I did not spend more time on this so not sure raw insert could be used for that. But I think the previous multi insert work could be still used in at least "INSERT tbl SELECT…” (if the INSERT is a simple one, e.g. no trigger, no index, etc). Regards, Paul v1-0001-ctas-using-raw-insert.patch Description: v1-0001-ctas-using-raw-insert.patch
Re: Multi Inserts in CREATE TABLE AS - revived patch
> On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy > wrote: > > On Tue, Nov 10, 2020 at 3:47 PM Paul Guo wrote: >> >> Thanks for doing this. There might be another solution - use raw insert >> interfaces (i.e. raw_heap_insert()). >> Attached is the test (not formal) patch that verifies this idea. >> raw_heap_insert() writes the page into the >> table files directly and also write the FPI xlog when the tuples filled up >> the whole page. This seems be >> more efficient. >> > > Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend > the table parallelly) with parallelism? The existing > table_multi_insert() API scales well, see, for instance, the benefit > with parallel copy[1] and parallel multi inserts in CTAS[2]. Yes definitely some work needs to be done to make raw heap insert interfaces fit the parallel work, but it seems that there is no hard blocking issues for this? > > [1] - > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%2540mail.gmail.com&data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136197927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=fyQaor4yhmqVRYcK78JyPW25i7zjRoWXqZVf%2BfFYq1w%3D&reserved=0 > [2] - > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%252B0-Jg%252BKYT7ZO-Ug%2540mail.gmail.com&data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=CkFToJ11nmoyT2SodsJYYMOGP3cHSpeNYn8ZTYurn3U%3D&reserved=0 > > With Regards, > Bharath Rupireddy. > EnterpriseDB: > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F&data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=btiktR5Ftx1astyEmCUroQCIN1%2FcgcaMOxfA1z6pawE%3D&reserved=0
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Thanks for the review, please see the replies below. > On Jan 5, 2021, at 9:07 AM, Kyotaro Horiguchi wrote: > > At Wed, 8 Jul 2020 12:56:44 +0000, Paul Guo wrote in >> On 2020/01/15 19:18, Paul Guo wrote: >> I further fixed the last test failure (due to a small bug in the test, not >> in code). Attached are the new patch series. Let's see the CI pipeline >> result. >> >> Thanks for updating the patches! >> >> I started reading the 0003 patch. >> >> The approach that the 0003 patch uses is not the perfect solution. >> If the standby crashes after tblspc_redo() removes the directory and before >> its subsequent COMMIT record is replayed, PANIC error would occur since >> there can be some unresolved missing directory entries when we reach the >> consistent state. The problem would very rarely happen, though... >> Just idea; calling XLogFlush() to update the minimum recovery point just >> before tblspc_redo() performs destroy_tablespace_directories() may be >> safe and helpful for the problem? > > It seems to me that what the current patch does is too complex. What > we need to do here is to remember every invalid operation then forget > it when the prerequisite object is dropped. > > When a table space is dropped before consistency is established, we > don't need to care what has been performed inside the tablespace. In > this perspective, it is enough to remember tablespace ids when failed > to do something inside it due to the absence of the tablespace and > then forget it when we remove it. We could remember individual > database id to show them in error messages, but I'm not sure it's > useful. The reason log_invalid_page records block numbers is to allow > the machinery handle partial table truncations, but this is not the > case since dropping tablespace cannot leave some of containing > databases. > > As the result, we won't see an unresolved invalid operations in a > dropped tablespace. > > Am I missing something? Yes, removing the database id from the hash key in the log/forget code should be usually fine, but the previous code does stricter/safer checking. Consider the scenario: CREATE DATABASE newdb1 TEMPLATE template_db1; CREATE DATABASE newdb2 TEMPLATE template_db2; <- in case the template_db2 database directory is missing abnormally somehow. DROP DATABASE template_db1; The previous code could detect this but if we remove the database id in the code, this bad scenario is skipped. > > > dbase_redo: > + if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) > + { > +XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path); > > This means "record the belonging table space directory if it is not > found OR it is not a directory". The former can be valid but the > latter is unconditionally can not (I don't think we bother considering > symlinks there). Again this is a safer check, in the case the parent_path is a file for example somehow, we should panic finally for the case and let the user checks and then does recovery again. > > +/* > + * Source directory may be missing. E.g. the template database used > + * for creating this database may have been dropped, due to reasons > + * noted above. Moving a database from one tablespace may also be a > + * partner in the crime. > + */ > +if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode))) > +{ > + XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, > src_path); > > This is a part of *creation* of the target directory. Lack of the > source directory cannot be valid even if the source directory is > dropped afterwards in the WAL stream and we can allow that if the > *target* tablespace is dropped afterwards. As the result, as I > mentioned above, we don't need to record about the database directory. > > By the way the name XLogLogMiss.. is somewhat confusing. How about > XLogReportMissingDir (named after report_invalid_page). Agree with you. Also your words remind me that we should skip the checking if the consistency point is reached. Here is a git diff against the previous patch. I’ll send out the new rebased patches after the consensus is reached. diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 7ade385965..c8fe3fe228 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -90,7 +90,7 @@ typedef struct xl_missing_dir static HTAB *missing_dir_tab = NULL; void -XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path) +XLogReportMissingDir(Oid spcNode, Oid dbNode, char *pa
Two patches to speed up pg_rewind.
While reading pg_rewind code I found two things could speed up pg_rewind. Attached are the patches. First one: pg_rewind would fsync the whole pgdata directory on the target by default, but that is a waste since usually just part of the files/directories on the target are modified. Other files on the target should have been flushed since pg_rewind requires a clean shutdown before doing the real work. This would help the scenario that the target postgres instance includes millions of files, which has been seen in a real environment. There are several things that may need further discussions: 1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”, but now the code guarded by it is just pre_sync_fname() relevant so we might want to rename it as HAVE_PRE_SYNC kind of name? 2. Pre_sync_fname() implementation The code looks like this: #if defined(HAVE_SYNC_FILE_RANGE) (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED. I did not check the Linux Kernel code but according to the man page I suspect that this option might cause the kernel tends to evict the related kernel pages from the page cache, which might not be something we expect. This is not a big issue since sync_file_range() should exist on many widely used Linux. Also I’m not sure how much we could benefit from the pre_sync code. Also note if the directory has a lot of files or the IO is fast, pre_sync_fname() might slow down the process instead. The reasons are: If there are a lot of files it is possible that we need to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background dirty page flush quickly, pre_sync_fname() might just waste cpu cycles. A better solution might be launch a separate pthread and do fsync one by one when pg_rewind finishes handling one file. pg_basebackup could use the solution also. Anyway this is independent of this patch. Second one is use copy_file_range() for the local rewind case to replace read()+write(). This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other code could use copy_file_range() if needed. copy_file_range() was introduced In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap() might be better than read()+write() but copy_file_range() is more interesting given that it could skip the data copying in some file systems - this could benefit more on Linux fs on network-based block storage. Regards, Paul 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch Description: 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch Description: 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Freeze the inserted tuples during CTAS?
Here is the simple patch, diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index dce882012e..0391699423 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) myState->rel = intoRelationDesc; myState->reladdr = intoRelationAddr; myState->output_cid = GetCurrentCommandId(true); - myState->ti_options = TABLE_INSERT_SKIP_FSM; + myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN; MatView code already does this and COPY does this if specified. I’m not sure how does the community think about this. Actually personally I expect more about the all-visible setting due to TABLE_INSERT_FROZEN since I could easier use index only scan if we create an index and table use CTAS, else people have to use index only scan after vacuum. If people do not expect freeze could we at least introduce a option to specify the visibility during inserting? Regards, Paul
Re: Two patches to speed up pg_rewind.
On Jan 28, 2021, at 3:31 PM, Michael Paquier mailto:mich...@paquier.xyz>> wrote: On Wed, Jan 27, 2021 at 09:18:48AM +, Paul Guo wrote: Second one is use copy_file_range() for the local rewind case to replace read()+write(). This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other code could use copy_file_range() if needed. copy_file_range() was introduced In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap() might be better than read()+write() but copy_file_range() is more interesting given that it could skip the data copying in some file systems - this could benefit more on Linux fs on network-based block storage. Have you done some measurements? I did not test pg_rewind but for patch 2, I tested copy_fiile_range() vs read()+write() on XFS in Ubuntu 20.04.1 when working on the patches, Here is the test time of 1G file (fully populated with random data) copy. The test is a simple C program. copy_file_range() loop (actually it finished after one call) + fsync() 0m0.048s For read()+write() loop with read/write buffer size 32K + fsync() 0m5.004s For patch 1, it skips syncing less files so it surely benefits the performance.
Re: fdatasync performance problem with large number of DB files
On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro wrote: > > On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao > wrote: > > On 2021/03/16 8:15, Thomas Munro wrote: > > > I don't want to add a hypothetical sync_after_crash=none, because it > > > seems like generally a bad idea. We already have a > > > running-with-scissors mode you could use for that: fsync=off. > > > > I heard that some backup tools sync the database directory when restoring > > it. > > I guess that those who use such tools might want the option to disable such > > startup sync (i.e., sync_after_crash=none) because it's not necessary. > > Hopefully syncfs() will return quickly in that case, without doing any work? I just quickly reviewed the patch (the code part). It looks good. Only one concern or question is do_syncfs() for symlink opened fd for syncfs() - I'm not 100% sure. I think we could consider reviewing and then pushing the syncfs patch at this moment since the fsync issue really affects much although there seems to be a better plan for this in the future, it may make the sync step in startup much faster. Today I first encountered a real case in a production environment. startup spends >1hour on the fsync step: I'm pretty sure that the pgdata is sync-ed, and per startup pstack I saw the startup process one by one slowly open(), fsync() (surely do nothing) and close(), and the pre_sync_fname() is also a burden in such a scenario. So this issue is really annoying. We could discuss further optimizing the special crash recovery scenario that users explicitly know the sync step could be skipped (this scenario is surely not unusual), even having the patch. The syncfs patch could be used for this scenario also but the filesystem might be shared by other applications (this is not unusual and happens in my customer's environment for example) so syncfs() is (probably much) slower than skipping the sync step. -- Paul Guo (Vmware)
Re: fdatasync performance problem with large number of DB files
About the syncfs patch, my first impression on the guc name sync_after_crash is that it is a boolean type. Not sure about other people's feeling. Do you guys think It is better to rename it to a clearer name like sync_method_after_crash or others?
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On 2021/3/27, 10:23 PM, "Alvaro Herrera" wrote: >Hmm, can you post a rebased set, where the points under discussion > are marked in XXX comments explaining what the issue is? This thread is >long and old ago that it's pretty hard to navigate the whole thing in >order to find out exactly what is being questioned. OK. Attached are the rebased version that includes the change I discussed in my previous reply. Also added POD documentation change for RecursiveCopy, and modified the patch to use the backup_options introduced in 081876d75ea15c3bd2ee5ba64a794fd8ea46d794 for tablespace mapping. >I think 0004 can be pushed without further ado, since it's a clear and >simple fix. 0001 needs a comment about the new parameter in >RecursiveCopy's POD documentation. Yeah, 0004 is no any risky. One concern seemed to be the compatibility of some WAL dump/analysis tools(?). I have no idea about this. But if we do not backport 0004 we do not seem to need to worry about this. >As I understand, this is a backpatchable bug-fix. Yes. Thanks. v11-0001-Support-node-initialization-from-backup-with-tab.patch Description: v11-0001-Support-node-initialization-from-backup-with-tab.patch v11-0002-Tests-to-replay-create-database-operation-on-sta.patch Description: v11-0002-Tests-to-replay-create-database-operation-on-sta.patch v11-0003-Fix-replay-of-create-database-records-on-standby.patch Description: v11-0003-Fix-replay-of-create-database-records-on-standby.patch v11-0004-Fix-database-create-drop-wal-description.patch Description: v11-0004-Fix-database-create-drop-wal-description.patch
Some two phase optimization ideas
Hello hackers, While working on two phase related issues, I found something related to two phase could be optimized. 1. The current implementation decouples PREPRE and COMMIT/ABORT PREPARE a lot. This is flexible, but if PREPARE & COMMIT/ABORT mostly happens on the same backend we could use the cache mechanism to speed up, e.g. a. FinishPreparedTransaction()->LockGXact(gid, user) for (i = 0; i < TwoPhaseState->numPrepXacts; i++) find the gxact that matches gid For this we can cache the gxact during PREPARE and use that for a fast path, i.e. if the cached gxact matches gid we do not need to walk through the gxact array. By the way, if the gxact array is large this will be a separate performance issue (use shared-memory hash table if needed?). b. FinishPreparedTransaction() reads the PREPARE information from either state file (stored during checkpoint) or wal file. We could cache the content during PREPARE, i.e. in EndPrepare() then in FinishPreparedTransaction() we can avoid reading the state file or the wal file. It is possible that some databases based on Postgres two phase might not want the cache, e.g. if PREPARE backend is always different than the COMMIT/ABORT PREPARE backend (I do not know what database is designing like this though), but gxact caching is almost no overhead and for b we could use ifdef to guard the PREPARE wal data copying code. The two optimizations are easy and small. I've verified on Greenplum database (based on Postgres 12). 2. wal content duplication between PREPARE and COMMT/ABORT PREPARE See the below COMMIT PREPARE function call. Those hdr->* have existed in PREPARE wal also. We do not need them in the COMMIT PREPARE wal also. During recovery, we could load these information (both COMMIT and ABORT) into memory and in COMMIT/ABORT PREPARE redo we use the corresponding data. RecordTransactionCommitPrepared(xid, hdr->nsubxacts, children, hdr->ncommitrels, commitrels, hdr->ninvalmsgs, invalmsgs, hdr->initfileinval, gid); One drawback of the change is this might involve non-trivial change. 3. About gid, current gid is defined as a char[]. I'm wondering if we should define an opaque type and let some Databases implement their own gid types using callbacks. Typically if I want to use 64-bit distributed xid as gid, current code is not that performance & storage friendly (e.g. still need to use strcmp to find gxact in LockGXact,). We may implement a default implementation as char[]. gid is not widely used so the change seems to be small (interfaces of copy, comparison, conversion from string to internal gid type for the PREPARE statement, etc) Any thoughts? Regards, Paul
pg_ugprade test failure on data set with column with default value with type bit/varbit
Hello, While testing pg_upgrade we seemed to find an issue related to default value of a column with type bit/varbit. Below are the steps to reproduce. In this case we added two 'create table' DDLs in the regression database. Obviously we saw diff after pg_upgrade testing. The pg binaries are based on the code pulled a couple of days ago. [pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ git diff diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh old mode 100644 new mode 100755 index 39983abea1..a41105859e --- a/src/bin/pg_upgrade/test.sh +++ b/src/bin/pg_upgrade/test.sh @@ -188,6 +188,9 @@ if "$MAKE" -C "$oldsrc" installcheck; then psql -X -d regression -c "$fix_sql;" || psql_fix_sql_status=$? fi + psql -X -d regression -c "CREATE TABLE t111 ( a40 bit varying(5) DEFAULT '1');" + psql -X -d regression -c "CREATE TABLE t222 ( a40 bit varying(5) DEFAULT B'1');" + pg_dumpall --no-sync -f "$temp_root"/dump1.sql || pg_dumpall1_status=$? if [ "$newsrc" != "$oldsrc" ]; then [pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ make check [pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ diff -du tmp_check/dump1.sql tmp_check/dump2.sql --- tmp_check/dump1.sql 2018-03-30 16:31:44.329021348 +0800 +++ tmp_check/dump2.sql 2018-03-30 16:31:54.100478202 +0800 @@ -10956,7 +10956,7 @@ -- CREATE TABLE public.t111 ( -a40 bit varying(5) DEFAULT B'1'::bit varying +a40 bit varying(5) DEFAULT (B'1'::"bit")::bit varying ); There is no diff in functionality of the dump SQLs, but it is annoying. The simple patch below could fix this. Thanks. [pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ git diff diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 2cd54ec33f..ef2ab2d681 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -9389,7 +9389,7 @@ get_const_expr(Const *constval, deparse_context *context, int showtype) case BITOID: case VARBITOID: - appendStringInfo(buf, "B'%s'", extval); + appendStringInfo(buf, "'%s'", extval); break; case BOOLOID:
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Tue, Sep 3, 2019 at 11:58 PM Alvaro Herrera wrote: > On 2019-Aug-22, Anastasia Lubennikova wrote: > > > 22.08.2019 16:13, Paul Guo wrote: > > > Thanks. I updated the patch to v5. It passes install-check testing and > > > recovery testing. > > Hi, > > Thank you for working on this fix. > > The overall design of the latest version looks good to me. > > But during the review, I found a bug in the current implementation. > > New behavior must apply to crash-recovery only, now it applies to > > archiveRecovery too. > > Hello > > Paul, Kyotaro, are you working on updating this bugfix? FWIW the latest > patch submitted by Paul is still current and CFbot says it passes its > own test, but from Anastasia's email it still needs a bit of work. > > Also: it would be good to have this new bogus scenario described by > Anastasia covered by a new TAP test. Anastasia, can we enlist you to > write that? Maybe Kyotaro? > > Thanks Anastasia and Alvaro for comment and suggestion. Sorry I've been busy working on some non-PG stuffs recently. I've never worked on archive recovery, so I expect a bit more time after I'm free (hopefully several days later) to take a look. Of course Kyotaro, Anastasia or anyone feel free to address the concern before that.
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > It seems there's minor breakage in the build, per CFbot. Can you > please rebase this? > > There is a code conflict. See attached for the new version. Thanks. v5-0001-Extact-common-functions-from-pg_basebackup-into-s.patch Description: Binary data v5-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data v5-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data
Re: Batch insert in CTAS/MatView code
On Fri, Aug 2, 2019 at 2:55 AM Heikki Linnakangas wrote: > On 17/06/2019 15:53, Paul Guo wrote: > > I noticed that to do batch insert, we might need additional memory copy > > sometimes comparing with "single insert" > > (that should be the reason that we previously saw a bit regressions) so a > > good solution seems to fall back > > to "single insert" if the tuple length is larger than a threshold. I set > > this as 2000 after quick testing. > > Where does the additional memory copy come from? Can we avoid doing it > in the multi-insert case? Hi Heikki, Sorry for the late reply. I took some time on looking at & debugging the code of TupleTableSlotOps of various TupleTableSlot types carefully, especially the BufferHeapTupleTableSlot case on which we seemed to see regression if no threshold is set, also debugging & testing more of the CTAS case. I found my previous word "additional memory copy" (mainly tuple content copy against single insert) is wrong based on the latest code (probably is wrong also with previous code). So in theory we should not worry about additional tuple copy overhead now, and then I tried the patch without setting multi-insert threshold as attached. To make test results more stable, this time I run a simple ' select count(*) from tbl' before each CTAS to warm up the shared buffer, run checkpoint before each CTAS, disable autovacuum by setting 'autovacuum = off', set larger shared buffers (but < 25% of total memory which is recommended by PG doc) so that CTAS all hits shared buffer read if there exists warm buffers (double-checked via explain(analyze, buffers)). These seem to be reasonable for performance testing. Each kind of CTAS testing is run three times (Note before each run we do warm up and checkpoint as mentioned). I mainly tested the t12 (normal table with tuple size ~ 2k) case since for others our patch either performs better or similarly. Patch: 1st_run 2nd_run3rd_run t12_BufferHeapTuple 7883.400 7549.9668090.080 t12_Virtual 8041.637 8191.3178182.404 Baseline: 1st_run 2nd_run3rd_run t12_BufferHeapTuple: 8264.290 7508.410 7681.702 t12_Virtual 8167.792 7970.537 8106.874 I actually roughly tested other tables we mentioned also (t11 and t14) - the test results and conclusions are same. t12_BufferHeapTuple means: create table tt as select * from t12; t12_Virtual means: create table tt as select *partial columns* from t12; So it looks like for t12 the results between our code and baseline are similar so not setting threshoud seem to be good though it looks like t12_BufferHeapTuple test results varies a lot (at most 0.5 seconds) for both our patch and baseline vs the virtual case which is quite stable. This actually confused me a bit given we've cached the source table in shared buffers. I suspected checkpoint affects, so I disabled checkpoint by setting max_wal_size = 3000 during CTAS, the BufferHeapTuple case (see below) still varies some. I'm not sure what's the reason but this does not seem to a be blocker for the patch. Patch: 1st_run 2nd_run3rd_run t12_BufferHeapTuple 7717.3047413.2597452.773 t12_Virtual 7445.742 7483.148 7593.583 Baseline: 1st_run 2nd_run3rd_run t12_BufferHeapTuple 8186.302 7736.541 7759.056 t12_Virtual 8004.880 8096.712 7961.483 v4-0001-Multi-insert-in-Create-Table-As.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > Thank for rebasing. > > I didn't like 0001 very much. > > * It seems now would be the time to stop pretending we're using a file > called recovery.conf; I know we still support older server versions that > use that file, but it sounds like we should take the opportunity to > rename the function to be less misleading once those versions vanish out > of existance. > How about renaming the function names to GenerateRecoveryConf -> GenerateRecoveryConfContents WriteRecoveryConf -> WriteRecoveryConfInfo<- it writes standby.signal on pg12+, so function name WriteRecoveryConfContents is not accurate. and variable writerecoveryconf -> write_recovery_conf_info? > * disconnect_and_exit seems a bit out of place compared to the other > parts of this new module. I think you only put it there so that the > 'conn' can be a global, and that you can stop passing 'conn' as a > variable to GenerateRecoveryConf. It seems more modular to me to keep > it as a separate variable in each program and have it passed down to the > routine that writes the file. > > * From modularity also seems better to me to avoid a global variable > 'recoveryconfcontents' and instead return the string from > GenerateRecoveryConf to pass as a param to WriteRecoveryConf. > (In fact, I wonder why the current structure is like it is, namely to > have ReceiveAndUnpackTarFile write the file; why wouldn't be its caller > be responsible for writing it?) > Reasonable to make common code include less variables. I can try modifying the patches to remove the previously added variables below in the common code. +/* Contents of configuration file to be generated */ +extern PQExpBuffer recoveryconfcontents; + +extern bool writerecoveryconf; +extern char *replication_slot; +PGconn*conn; > > I wonder about putting this new file in src/fe_utils instead of keeping > it in pg_basebackup and symlinking to pg_rewind. Maybe if we make it a > true module (recovery_config_gen.c) it makes more sense there. > > I thought some about where to put the common code also. It seems pg_rewind and pg_basebackup are the only consumers of the small common code. I doubt it deserves a separate file under src/fe_utils. > > 0003: > > I still don't understand why we need a command-line option to do this. > Why should it not be the default behavior? > This was discussed but frankly speaking I do not know how other postgres users or enterprise providers handle this (probably some have own scripts?). I could easily remove the option code if more and more people agree on that or at least we could turn it on by default? Thanks
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
I've updated the patch series following the suggestions. Thanks. > > v6-0001-Extact-common-functions-from-pg_basebackup-into-s.patch Description: Binary data v6-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data v6-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
The patch series failed on Windows CI. I modified the Windows build file to fix that. See attached for the v7 version. v7-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data v7-0001-Extact-common-recovery-related-functions-from-pg_.patch Description: Binary data v7-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > On 2019-Sep-20, Paul Guo wrote: > > > The patch series failed on Windows CI. I modified the Windows build file > to > > fix that. See attached for the v7 version. > > Thanks. > > Question about 0003. If we specify --skip-clean-shutdown and the cluter > was not cleanly shut down, shouldn't we error out instead of trying to > press on? pg_rewind would error out in this case, see sanityChecks(). Users are expected to clean up themselves if they use this argument.
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On Thu, Sep 26, 2019 at 1:48 AM Alvaro Herrera wrote: > CC Alexey for reasons that become clear below. > > On 2019-Sep-25, Paul Guo wrote: > > > > Question about 0003. If we specify --skip-clean-shutdown and the > cluter > > > was not cleanly shut down, shouldn't we error out instead of trying to > > > press on? > > > > pg_rewind would error out in this case, see sanityChecks(). > > Users are expected to clean up themselves if they use this argument. > > Ah, good. We should have a comment about that below the relevant > stanza, I suggest. (Or maybe in the same comment that ends in line > 272). > > I pushed 0001 with a few tweaks. Nothing really substantial, just my > OCD that doesn't leave me alone ... but this means your subsequent > patches need to be adjusted. One thing is that that patch touched > pg_rewind for no reason (those changes should have been in 0002) -- > dropped those. > > Another thing in 0002 is that you're adding a "-R" switch to pg_rewind, > but we have another patch in the commitfest using the same switch for a > different purpose. Maybe you guys need to get to an agreement over who > uses the letter :-) Also, it would be super helpful if you review > Alexey's patch: https://commitfest.postgresql.org/24/1849/ > > > This line is far too long: > > + printf(_(" -s, --skip-clean-shutdown skip running single-mode > postgres if needed to make sure target is clean shutdown\n")); > > Can we make the description more concise? > Thanks. I've updated the reset two patches and attached as v8. Note in the 2nd patch, the long option is changed as below. Both the option and description now seems to be more concise since we want db state as either DB_SHUTDOWNED or DB_SHUTDOWNED_IN_RECOVERY. "-s, --no-ensure-shutdowned do not auto-fix unclean shutdown" v8-0001-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data v8-0002-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > > Yes, -R is already used in pg_basebackup for the same functionality, so > it seems natural to use it here as well for consistency. > > I will review options naming in my own patch and update it accordingly. > Maybe -w/-W or -a/-A options will be good, since it's about WALs > retrieval from archive. > > Thanks > > Regards > -- > Alexey > > P.S. Just noticed that in v12 fullname of -R option in pg_basebackup is > still --write-recovery-conf, which is good for a backward compatibility, > but looks a little bit awkward, since recovery.conf doesn't exist > already, doesn't it? However, one may read it as > 'write-recovery-configuration', then it seems fine. > > Yes, here is the description "--write-recovery-conf write configuration for replication" So we do not mention that is the file recovery.conf. People do not know about the recovery.conf history might really not be confused since postgresql has various configuration files.
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > > > Note in the 2nd patch, the long option is changed as below. Both the > option > > and description > > now seems to be more concise since we want db state as either > DB_SHUTDOWNED > > or > > DB_SHUTDOWNED_IN_RECOVERY. > > > > "-s, --no-ensure-shutdowned do not auto-fix unclean shutdown" > > Note that "shutdowned" is incorrect English; we've let > it live in the code because it's not user-visible, but we should > certainly not immortalize it where it becomes so. I suppose > "--no-ensure-shutdown" is okay, although I think some may prefer > "--no-ensure-shut-down". Opinions from native speakers would be > welcome. Also, let's expand "auto-fix" to "automatically fix" (or > "repair" if there's room in the line? Not sure. Can be bikeshedded to > death I guess.) > I choose that one from the below tree. --no-ensure-shutdown --no-ensure-shutdowned --no-ensure-clean-shutdown Now I agree for user experience we should not use the 2nd one. For --no-ensure-clean-shutdown or -no-ensure-shut-down, seems too many -. I'm using --no-ensure-shutdown in the new version unless there are better suggestions. > > Secondarily, I see no reason to test connstr_source rather than just > "conn" in the other patch; doing it the other way is more natural, since > it's that thing that's tested as an argument. > > pg_rewind.c: Please put the new #include line keeping the alphabetical > order. > Agreed to the above suggestions. I attached the v9. Thanks. v9-0001-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data v9-0002-Ensure-target-clean-shutdown-in-pg_rewind.patch Description: Binary data
Re: Batch insert in CTAS/MatView code
Asim Thanks for the review. On Wed, Sep 25, 2019 at 6:39 PM Asim R P wrote: > > > > On Mon, Sep 9, 2019 at 4:02 PM Paul Guo wrote: > > > > So in theory > > we should not worry about additional tuple copy overhead now, and then I > tried the patch without setting > > multi-insert threshold as attached. > > > > I reviewed your patch today. It looks good overall. My concern is that > the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic > place such as createas.c, we should be using generic tableam API only. > However, I can also see that there is no better alternative. We need to > compute the size of accumulated tuples so far, in order to decide whether > to stop accumulating tuples. There is no convenient way to obtain the > length of the tuple, given a slot. How about making that decision solely > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call > altogether? > For heapam, ExecFetchSlotHeapTuple() will be called again in heap_multi_insert() to prepare the final multi-insert. if we check ExecFetchSlotHeapTuple(), we could find that calling it multiple time just involves very very few overhead for the BufferHeapTuple case. Note for virtual tuple case the 2nd ExecFetchSlotHeapTuple() call still copies slot contents, but we've called ExecCopySlot(batchslot, slot); to copy to a BufferHeap case so no worries for the virtual tuple case (as a source). Previously (long ago) I probably understood the code incorrectly so had the concern also. I used sampling to do that (for variable-length tuple), but now apparently we do not need that. > > The multi insert copy code deals with index tuples also, which I don't see > in the patch. Don't we need to consider populating indexes? > create table as/create mat view DDL does not involve index creation for the table/matview. The code seems to be able to used in RefreshMatView also, for that we need to consider if we use multi-insert in that code.
Re: Batch insert in CTAS/MatView code
On Thu, Sep 26, 2019 at 9:43 PM Alvaro Herrera wrote: > On 2019-Sep-25, Asim R P wrote: > > > I reviewed your patch today. It looks good overall. My concern is that > > the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic > > place such as createas.c, we should be using generic tableam API only. > > However, I can also see that there is no better alternative. We need to > > compute the size of accumulated tuples so far, in order to decide whether > > to stop accumulating tuples. There is no convenient way to obtain the > > length of the tuple, given a slot. How about making that decision solely > > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple > call > > altogether? > > ... maybe we should add a new operation to slots, that returns the > (approximate?) size of a tuple? That would make this easy. (I'm not > sure however what to do about TOAST considerations -- is it size in > memory that we're worried about?) > > Also: > > + myState->mi_slots_size >= 65535) > > This magic number should be placed in a define next to the other one, > but I'm not sure that heapam.h is a good location, since surely this > applies to matviews in other table AMs too. > > yes defining 65535 seems better. Let's fix this one later when having more feedback. Thanks.
Re: Batch insert in CTAS/MatView code
On Sat, Sep 28, 2019 at 5:49 AM Andres Freund wrote: > Hi, > > On 2019-09-09 18:31:54 +0800, Paul Guo wrote: > > diff --git a/src/backend/access/heap/heapam.c > b/src/backend/access/heap/heapam.c > > index e9544822bf..8a844b3b5f 100644 > > --- a/src/backend/access/heap/heapam.c > > +++ b/src/backend/access/heap/heapam.c > > @@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation, > TupleTableSlot **slots, int ntuples, > > CommandId cid, int options, > BulkInsertState bistate) > > { > > TransactionId xid = GetCurrentTransactionId(); > > - HeapTuple *heaptuples; > > int i; > > int ndone; > > PGAlignedBlock scratch; > > @@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation, > TupleTableSlot **slots, int ntuples, > > SizesaveFreeSpace; > > boolneed_tuple_data = > RelationIsLogicallyLogged(relation); > > boolneed_cids = > RelationIsAccessibleInLogicalDecoding(relation); > > + /* Declare it as static to let this memory be not on stack. */ > > + static HeapTupleheaptuples[MAX_MULTI_INSERT_TUPLES]; > > + > > + Assert(ntuples <= MAX_MULTI_INSERT_TUPLES); > > > > /* currently not needed (thus unsupported) for heap_multi_insert() > */ > > AssertArg(!(options & HEAP_INSERT_NO_LOGICAL)); > > @@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation, > TupleTableSlot **slots, int ntuples, > > > HEAP_DEFAULT_FILLFACTOR); > > > > /* Toast and set header data in all the slots */ > > - heaptuples = palloc(ntuples * sizeof(HeapTuple)); > > for (i = 0; i < ntuples; i++) > > { > > HeapTuple tuple; > > I don't think this is a good idea. We shouldn't unnecessarily allocate > 8KB on the stack. Is there any actual evidence this is a performance > benefit? To me this just seems like it'll reduce the flexibility of the > Previous heaptuples is palloc-ed in each batch, which should be slower than pre-allocated & reusing memory in theory. API, without any benefit. I'll also note that you've apparently not > updated tableam.h to document this new restriction. > Yes it should be moved from heapam.h to that file along with the 65535 definition.
Re: Batch insert in CTAS/MatView code
> > > > > However, I can also see that there is no better alternative. We need > to > > > compute the size of accumulated tuples so far, in order to decide > whether > > > to stop accumulating tuples. There is no convenient way to obtain the > > > length of the tuple, given a slot. How about making that decision > solely > > > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple > call > > > altogether? > > > > ... maybe we should add a new operation to slots, that returns the > > (approximate?) size of a tuple? > > Hm, I'm not convinced that it's worth adding that as a dedicated > operation. It's not that clear what it'd exactly mean anyway - what > would it measure? As referenced in the slot? As if it were stored on > disk? etc? > > I wonder if the right answer wouldn't be to just measure the size of a > memory context containing the batch slots, or something like that. > > Probably a better way is to move those logic (append slot to slots, judge when to flush, flush, clean up slots) into table_multi_insert()? Generally the final implementation of table_multi_insert() should be able to know the sizes easily. One concern is that currently just COPY in the repo uses multi insert, so not sure if other callers in the future want their own logic (or set up a flag to allow customization but seems a bit over-designed?).
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > > I went through the remaining two patches and they seem to be very clear > and concise. However, there are two points I could complain about: > > 1) Maybe I've missed it somewhere in the thread above, but currently > pg_rewind allows to run itself with -R and --source-pgdata. In that case > -R option is just swallowed and neither standby.signal, nor > postgresql.auto.conf is written, which is reasonable though. Should it > be stated somehow in the docs that -R option always has to go altogether > with --source-server? Or should pg_rewind notify user that options are > incompatible and no recovery configuration will be written? > I modified code & doc to address this. In code, pg_rewind will error out for the local case. > 2) Are you going to leave -R option completely without tap-tests? > Attached is a small patch, which tests -R option along with the existing > 'remote' case. If needed it may be split into two separate cases. First, > it tests that pg_rewind is able to succeed with minimal permissions > according to the Michael's patch d9f543e [1]. Next, it checks presence > of standby.signal and adds REPLICATION permission to rewind_user to test > that new standby is able to start with generated recovery configuration. > > [1] > > https://github.com/postgres/postgres/commit/d9f543e9e9be15f92abdeaf870e57ef289020191 > > It seems that we could further disabling recovery info setting code for the 'remote' test case? - my $port_standby = $node_standby->port; - $node_master->append_conf( - 'postgresql.conf', qq( -primary_conninfo='port=$port_standby' -)); + if ($test_mode ne "remote") + { + my $port_standby = $node_standby->port; + $node_master->append_conf( + 'postgresql.conf', + qq(primary_conninfo='port=$port_standby')); - $node_master->set_standby_mode(); + $node_master->set_standby_mode(); + } Thanks. v10-0001-Add-option-to-write-recovery-configuration-infor.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
> > BTW in the future if you have two separate patches, please post them in > separate threads and use separate commitfest items for each, even if > they have minor conflicts. > Sure. Thanks.
Batch insert in CTAS/MatView code
Hello, Postgres hackers, The copy code has used batch insert with function heap_multi_insert() to speed up. It seems that Create Table As or Materialized View could leverage that code also to boost the performance also. Attached is a patch to implement that. That was done by Taylor (cc-ed) and me. The patch also modifies heap_multi_insert() a bit to do a bit further code-level optimization by using static memory, instead of using memory context and dynamic allocation. For Modifytable->insert, it seems that there are more limitations for batch insert (trigger, etc?) but it seems that it is possible that we could do batch insert for the case that we could do? By the way, while looking at the code, I noticed that there are 9 local arrays with large length in toast_insert_or_update() which seems to be a risk of stack overflow. Maybe we should put it as static or global. Here is a quick simple performance testing on a mirrorless Postgres instance with the SQLs below. The tests cover tables with small column length, large column length and toast. -- tuples with small size. drop table if exists t1; create table t1 (a int); insert into t1 select * from generate_series(1, 1000); drop table if exists t2; \timing create table t2 as select * from t1; \timing -- tuples that are untoasted and data that is 1664 bytes wide drop table if exists t1; create table t1 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name); insert into t1 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' from generate_series(1, 50); drop table if exists t2; \timing create table t2 as select * from t1; \timing -- tuples that are toastable. drop table if exists t1; create table t1 (a text, b text, c text, d text, e text, f text, g text, h text, i text, j text, k text, l text, m text, n text, o text, p text, q text, r text, s text, t text, u text, v text, w text, x text, y text, z text); insert into t1 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i from (select repeat('123456789', 1) from generate_series(1,2000)) i; drop table if exists t2; \timing create table t2 as select * from t1; \timing Here are the timing results: With the patch, Time: 4728.142 ms (00:04.728) Time: 14203.983 ms (00:14.204) Time: 1008.669 ms (00:01.009) Baseline, Time: 11096.146 ms (00:11.096) Time: 13106.741 ms (00:13.107) Time: 1100.174 ms (00:01.100) While for toast and large column size there is < 10% decrease but for small column size the improvement is super good. Actually if I hardcode the batch count as 4 all test cases are better but the improvement for small column size is smaller than that with current patch. Pretty much the number 4 is quite case specific so I can not hardcode that in the patch. Of course we could further tune that but the current value seems to be a good trade-off? Thanks. 0001-Heap-batch-insert-for-CTAS.patch Description: Binary data
Re: Batch insert in CTAS/MatView code
Sorry for the late reply. To Michael. Thank you. I know this commitfest is ongoing and I'm not targeting for this. On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas wrote: > On 06/03/2019 22:06, Paul Guo wrote: > > The patch also modifies heap_multi_insert() a bit to do a bit further > > code-level optimization by using static memory, instead of using memory > > context and dynamic allocation. > > If toasting is required, heap_prepare_insert() creates a palloc'd tuple. > That is still leaked to the current memory context. > > Thanks. I checked the code for that but apparently, I missed that one. I'll see what proper context can be used for CTAS. For copy code maybe just revert my change. > Leaking into the current memory context is not a bad thing, because > resetting a memory context is faster than doing a lot of pfree() calls. > The callers just need to be prepared for that, and use a short-lived > memory context. > > > By the way, while looking at the code, I noticed that there are 9 local > > arrays with large length in toast_insert_or_update() which seems to be a > > risk of stack overflow. Maybe we should put it as static or global. > > Hmm. We currently reserve 512 kB between the kernel's limit, and the > limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those > arrays add up to 52800 bytes on a 64-bit maching, if I did my math > right. So there's still a lot of headroom. I agree that it nevertheless > seems a bit excessive, though. > I was worried about some recursive calling of it, but probably there should be no worry for toast_insert_or_update(). > > With the patch, > > > > Time: 4728.142 ms (00:04.728) > > Time: 14203.983 ms (00:14.204) > > Time: 1008.669 ms (00:01.009) > > > > Baseline, > > Time: 11096.146 ms (00:11.096) > > Time: 13106.741 ms (00:13.107) > > Time: 1100.174 ms (00:01.100) > > Nice speedup! > > > While for toast and large column size there is < 10% decrease but for > > small column size the improvement is super good. Actually if I hardcode > > the batch count as 4 all test cases are better but the improvement for > > small column size is smaller than that with current patch. Pretty much > > the number 4 is quite case specific so I can not hardcode that in the > > patch. Of course we could further tune that but the current value seems > > to be a good trade-off? > > Have you done any profiling, on why the multi-insert is slower with > large tuples? In principle, I don't see why it should be slower. > Thanks for the suggestion. I'll explore a bit more on this. > > - Heikki >
Re: Batch insert in CTAS/MatView code
On Mon, Mar 11, 2019 at 2:58 AM David Fetter wrote: > On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote: > > Hello, Postgres hackers, > > > > The copy code has used batch insert with function heap_multi_insert() to > > speed up. It seems that Create Table As or Materialized View could > leverage > > that code also to boost the performance also. Attached is a patch to > > implement that. > > This is great! > > Is this optimization doable for multi-row INSERTs, either with tuples > spelled out in the body of the query or in constructs like INSERT ... > SELECT ...? > Yes. That's "batch insert" in the ModifyTable nodes which I mentioned in the first email. By the way, batch is a usual optimization mechanism for iteration kind model (like postgres executor), so batch should benefit many executor nodes in theory also. > > Best, > David. > -- > David Fetter > https://urldefense.proofpoint.com/v2/url?u=http-3A__fetter.org_&d=DwIBAg&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=wgGDTDFzZV7nnMm0NFt-yGKmm_KZk18RXKP9HL8h6UE&s=tnaoLdajjR0Ew-93XUliHW1FUspVl09pIFd9aXxvqc8&e= > Phone: +1 415 235 3778 > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate >
Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
Hello, Postgres hackers, Please see the attached patches. The first patch adds an option to automatically generate recovery conf contents in related files, following pg_basebackup. In the patch, GenerateRecoveryConf(), WriteRecoveryConf() and escape_quotes() are almost same as them on pg_basebackup. The main difference is due to replication slot support and code (variables) limit. It seems that we could slightly refactor later to put some common code into another file after aligning pg_rewind with pg_basebackup. This was tested manually and was done by Jimmy (cc-ed), Ashiwin (cc-ed) and me. Another patch does automatic clean shutdown by running a single mode postgres instance if the target was not clean shut down since that is required by pg_rewind. This was manually tested and was done by Jimmy (cc-ed) and me. I'm not sure if we want a test case for that though. Thanks. 0001-Ensure-target-clean-shutdown-at-beginning-of-pg_rewi.patch Description: Binary data 0001-Auto-generate-recovery-conf-at-the-end-of-pg_re.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On Tue, Mar 19, 2019 at 2:18 PM Michael Paquier wrote: > On Tue, Mar 19, 2019 at 02:09:03PM +0800, Paul Guo wrote: > > The first patch adds an option to automatically generate recovery conf > > contents in related files, following pg_basebackup. In the patch, > > GenerateRecoveryConf(), WriteRecoveryConf() and escape_quotes() are > almost > > same as them on pg_basebackup. The main difference is due to replication > > slot support and code (variables) limit. It seems that we could slightly > > refactor later to put some common code into another file after aligning > > pg_rewind with pg_basebackup. This was tested manually and was done by > > Jimmy (cc-ed), Ashiwin (cc-ed) and me. > > > Interesting. The two routines have really the same logic, I would > recommend to have a first patch which does the refactoring and have > pg_rewind use it, and then a second patch which writes recovery.conf > and uses the first patch to get the contents. Please note that the > This is a good suggestion also. Will do it. > common routine needs to be version-aware as pg_basebackup requires > compatibility with past versions, but you could just pass the version > number from the connection, and have pg_rewind pass the compiled-in > version value. > > > Another patch does automatic clean shutdown by running a single mode > > postgres instance if the target was not clean shut down since that is > > required by pg_rewind. This was manually tested and was done by Jimmy > > (cc-ed) and me. I'm not sure if we want a test case for that though. > > I am not sure that I see the value in that. I'd rather let the > required service start and stop out of pg_rewind and not introduce > dependencies with other binaries. This step can also take quite some > This makes recovery more automatically. Yes, it will add the dependency on the postgres binary, but it seems that most time pg_rewind should be shipped as postgres in the same install directory. From my experience of manually testing pg_rewind, I feel that this besides auto-recovery-conf writing really alleviate my burden. I'm not sure how other users usually do before running pg_rewind when the target is not cleanly shut down, but probably we can add an argument to pg_rewind to give those people who want to handle target separately another chance? default on or off whatever. > time depending on the amount of WAL to replay post-crash at recovery > and the shutdown checkpoint which is required to reach a consistent > on-disk state. > The time is still required for people who want to make the target ready for pg_rewind in another way. Thanks.
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On Wed, Mar 20, 2019 at 1:20 PM Michael Paquier wrote: > On Wed, Mar 20, 2019 at 12:48:52PM +0800, Paul Guo wrote: > > This is a good suggestion also. Will do it. > > Please note also that we don't care about recovery.conf since v12 as > recovery parameters are now GUCs. I would suggest appending those > extra parameters to postgresql.auto.conf, which is what pg_basebackup > does. > Yes, the recovery conf patch in the first email did like this, i.e. writing postgresql.auto.conf & standby.signal
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Thanks. I updated the patch to v5. It passes install-check testing and recovery testing. On Fri, Aug 2, 2019 at 6:38 AM Thomas Munro wrote: > On Mon, Jul 15, 2019 at 10:52 PM Paul Guo wrote: > > Please see the attached v4 patch. > > While moving this to the next CF, I noticed that this needs updating > for the new pg_list.h API. > > -- > Thomas Munro > > https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=1zhC6VaaS7Ximav7vaUXMUt6EGjrVZpNZut32ug7LDI&s=jSDXnTPIW4WNZCCZ_HIbu7gZ3apEBx36DCeNeNuhLpY&e= > v5-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch Description: Binary data
Questions about SyncRepWaitForLSN()
Hello pg hackers, This is the definition of the function: SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) 1. In the code, it emits ereport(WARNING) for the ProcDiePending/QueryCancelPending case like this: ereport(WARNING, (errcode(ERRCODE_ADMIN_SHUTDOWN), errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"), errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); The message "The transaction has already committed locally" is wrong for non-commit waiting e.g. 2PC Prepare or AbortPrepare, right? so maybe we just give the errdtail for the commit==true case. 2. I'm curious how the client should proceed for the ProcDiePending corner case in the function (assuming synchronous_commit as remote_write or above). In this scenario, a transaction has been committed locally on master but we are not sure if the commit is replicated to standby or not if ProcDiePending happens. The commit is not in a safe status from the perspective of HA, for example if further when auto-failover happens, we may or may not lose the transaction commit on the standby and client just gets (and even can not get) a warning of unknown commit replication status. Thanks.
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Fri, Jan 10, 2020 at 9:43 PM Alvaro Herrera wrote: > On 2020-Jan-09, Alvaro Herrera wrote: > > > I looked at this a little while and was bothered by the perl changes; it > > seems out of place to have RecursiveCopy be thinking about tablespaces, > > which is way out of its league. So I rewrote that to use a callback: > > the PostgresNode code passes a callback that's in charge to handle the > > case of a symlink. Things look much more in place with that. I didn't > > verify that all places that should use this are filled. > > > > In 0002 I found adding a new function unnecessary: we can keep backwards > > compat by checking 'ref' of the third argument. With that we don't have > > to add a new function. (POD changes pending.) > > I forgot to add that something in these changes is broken (probably the > symlink handling callback) so the tests fail, but I couldn't stay away > from my daughter's birthday long enough to figure out what or how. I'm > on something else today, so if one of you can research and submit fixed > versions, that'd be great. > > Thanks, > I spent some time on this before getting off work today. With below fix, the 4th test is now ok but the 5th (last one) hangs due to panic. (gdb) bt #0 0x003397e32625 in raise () from /lib64/libc.so.6 #1 0x003397e33e05 in abort () from /lib64/libc.so.6 #2 0x00a90506 in errfinish (dummy=0) at elog.c:590 #3 0x00a92b4b in elog_finish (elevel=22, fmt=0xb2d580 "cannot find directory %s tablespace %d database %d") at elog.c:1465 #4 0x0057aa0a in XLogLogMissingDir (spcNode=16384, dbNode=0, path=0x1885100 "pg_tblspc/16384/PG_13_202001091/16389") at xlogutils.c:104 #5 0x0065e92e in dbase_redo (record=0x1841568) at dbcommands.c:2225 #6 0x0056ac94 in StartupXLOG () at xlog.c:7200 diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h index b71b400e700..f8f6d5ffd03 100644 --- a/src/include/commands/dbcommands.h +++ b/src/include/commands/dbcommands.h @@ -19,8 +19,6 @@ #include "lib/stringinfo.h" #include "nodes/parsenodes.h" -extern void CheckMissingDirs4DbaseRedo(void); - extern Oid createdb(ParseState *pstate, const CreatedbStmt *stmt); extern void dropdb(const char *dbname, bool missing_ok, bool force); extern void DropDatabase(ParseState *pstate, DropdbStmt *stmt); diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index e6e7ea505d9..4eef8bb1985 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -615,11 +615,11 @@ sub _srcsymlink my $srcrealdir = readlink($srcpath); opendir(my $dh, $srcrealdir); - while (readdir $dh) + while (my $entry = (readdir $dh)) { - next if (/^\.\.?$/); - my $spath = "$srcrealdir/$_"; - my $dpath = "$dstrealdir/$_"; + next if ($entry eq '.' or $entry eq '..'); + my $spath = "$srcrealdir/$entry"; + my $dpath = "$dstrealdir/$entry"; RecursiveCopy::copypath($spath, $dpath); } closedir $dh;
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result. v9-0001-Support-node-initialization-from-backup-with-tabl.patch Description: Binary data v9-0003-Fix-replay-of-create-database-records-on-standby.patch Description: Binary data v9-0002-Tests-to-replay-create-database-operation-on-stan.patch Description: Binary data
Re: Batch insert in CTAS/MatView code
I took some time on digging the issue yesterday so the main concern of the patch is to get the tuple length from ExecFetchSlotHeapTuple(). + ExecCopySlot(batchslot, slot); + tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL); + + myState->mi_slots_num++; + myState->mi_slots_size += tuple->t_len; We definitely should remove ExecFetchSlotHeapTuple() here but we need to know the tuple length (at least rough length). One solution might be using memory context stat info as mentioned, the code looks like this. tup_len = MemoryContextUsedSize(batchslot->tts_mcxt); ExecCopySlot(batchslot, slot); ExecMaterializeSlot(batchslot); tup_len = MemoryContextUsedSize(batchslot->tts_mcxt) - tup_len; MemoryContextUsedSize() is added to calculate the total used size (simply hack to use the stats interface). +int +MemoryContextUsedSize(MemoryContext context) +{ + MemoryContextCounters total; + + memset(&total, 0, sizeof(total)); + context->methods->stats(context, NULL, NULL, &total); + + return total.totalspace - total.freespace; +} This basically works but there are concerns: The length is not accurate (though we do not need to be that accurate) since there are some additional memory allocations, but we are not sure if the size is not much larger than the real length for some slot types in the future and I'm not sure whether we definitely allocate at least the tuple length in the memory context after materialization for all slot types in the future. Last is that the code seems to be a bit ugly also. As a reference, For "create table t1 as select * from t2", the above code returns "tuple length" is 88 (real tuple length is 4). Another solution is that maybe return the real size in ExecMaterializeSlot()? e.g. ExecMaterializeSlot(slot, &tup_len); For this we probably want to store the length in the slot struct for performance. For the COPY case the tuple length is known in advance but I can image more cases which do not know the size but need that for the multi_insert interface, at least I'm wondering if we should use that in 'refresh matview' and fast path for Insert node (I heard some complaints about the performance of "insert into tbl from select..." from some of our users)? So the concern is not just for the case in this patch. Besides, My colleagues Ashwin Agrawal and Adam Lee found maybe could try raw_heap_insert() similar code for ctas and compare since it is do insert in a new created table. That would involve more discussions, much more code change and need to test more (stability and performance). So multi insert seems to be a stable solution in a short time given that has been used in COPY for a long time? Whatever the solution for CTAS we need to address the concern of tuple size for multi insert cases.
standby recovery fails (tablespace related) (tentative patch and discussion)
Hello postgres hackers, Recently my colleagues and I encountered an issue: a standby can not recover after an unclean shutdown and it's related to tablespace. The issue is that the standby re-replay some xlog that needs tablespace directories (e.g. create a database with tablespace), but the tablespace directories has already been removed in the previous replay. In details, the standby normally finishes replaying for the below operations, but due to unclean shutdown, the redo lsn is not updated in pg_control and is still kept a value before the 'create db with tabspace' xlog, however since the tablespace directories were removed so it reports error when repay the database create wal. create db with tablespace drop database drop tablespace. Here is the log on the standby. 2019-04-17 14:52:14.926 CST [23029] LOG: starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit 2019-04-17 14:52:14.927 CST [23029] LOG: listening on IPv4 address "192.168.35.130", port 5432 2019-04-17 14:52:14.929 CST [23029] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2019-04-17 14:52:14.943 CST [23030] LOG: database system was interrupted while in recovery at log time 2019-04-17 14:48:27 CST 2019-04-17 14:52:14.943 CST [23030] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2019-04-17 14:52:14.949 CST [23030] LOG: entering standby mode 2019-04-17 14:52:14.950 CST [23030] LOG: redo starts at 0/30105B8 2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547 2019-04-17 14:52:14.951 CST [23029] LOG: startup process (PID 23030) exited with exit code 1 2019-04-17 14:52:14.951 CST [23029] LOG: terminating any other active server processes 2019-04-17 14:52:14.953 CST [23029] LOG: database system is shut down Steps to reprodce: 1. setup a master and standby. 2. On both side, run: mkdir /tmp/some_isolation2_pg_basebackup_tablespace 3. Run SQLs: drop tablespace if exists some_isolation2_pg_basebackup_tablespace; create tablespace some_isolation2_pg_basebackup_tablespace location '/tmp/some_isolation2_pg_basebackup_tablespace'; 3. Clean shutdown and restart both postgres instances. 4. Run the following SQLs: drop database if exists some_database_with_tablespace; create database some_database_with_tablespace tablespace some_isolation2_pg_basebackup_tablespace; drop database some_database_with_tablespace; drop tablespace some_isolation2_pg_basebackup_tablespace; \! pkill -9 postgres; ssh host70 pkill -9 postgres Note immediate shutdown via pg_ctl should also be able to reproduce and the above steps probably does not 100% reproduce. I created an initial patch for this issue (see the attachment). The idea is re-creating those directories recursively. The above issue exists in dbase_redo(), but TablespaceCreateDbspace (for relation file create redo) is probably buggy also so I modified that function also. Even there is no bug in that function, it seems that using simple pg_mkdir_p() is cleaner. Note reading TablespaceCreateDbspace(), I found it seems that this issue has already be thought though insufficient but frankly this solution (directory recreation) seems to be not perfect given actually this should have been the responsibility of tablespace creation (also tablespace creation does more like symlink creation, etc). Also, I'm not sure whether we need to use invalid page mechanism (see xlogutils.c). Another solution is that, actually, we create a checkpoint when createdb/movedb/dropdb/droptablespace, maybe we should enforce to create restartpoint on standby for such special kind of checkpoint wal - that means we need to set a flag in checkpoing wal and let checkpoint redo code to create restartpoint if that flag is set. This solution seems to be safer. Thanks, Paul 0001-Recursively-create-tablespace-directories-if-those-a.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
Hi Michael, I updated the patches as attached following previous discussions. The two patches: v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch v2-0002-Add-option-to-write-recovery-configuration-inform.patch 1. 0001 does move those common functions & variables to two new files (one .c and one .h) for both pg_rewind and pg_basebackup use, note the functions are slightly modified (e.g. because conn is probably NULL on pg_rewind). I do not know where is more proper to put the new files. Currently, they are under pg_basebackup and are used in pg_rewind (Makefile modified to support that). 2. 0002 adds the option to write recovery conf. The below patch runs single mode Postgres if needed to make sure the target is cleanly shutdown. A new option is added (off by default). v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch I've manually tested them and installcheck passes. Thanks. On Wed, Mar 20, 2019 at 1:23 PM Paul Guo wrote: > > > On Wed, Mar 20, 2019 at 1:20 PM Michael Paquier > wrote: > >> On Wed, Mar 20, 2019 at 12:48:52PM +0800, Paul Guo wrote: >> > This is a good suggestion also. Will do it. >> >> Please note also that we don't care about recovery.conf since v12 as >> recovery parameters are now GUCs. I would suggest appending those >> extra parameters to postgresql.auto.conf, which is what pg_basebackup >> does. >> > Yes, the recovery conf patch in the first email did like this, i.e. > writing postgresql.auto.conf & standby.signal > > v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch Description: Binary data v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch Description: Binary data v2-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Please see my replies inline. Thanks. On Fri, Apr 19, 2019 at 12:38 PM Asim R P wrote: > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo wrote: > > > > create db with tablespace > > drop database > > drop tablespace. > > Essentially, that sequence of operations causes crash recovery to fail > if the "drop tablespace" transaction was committed before crashing. > This is a bug in crash recovery in general and should be reproducible > without configuring a standby. Is that right? > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on master. That makes the file/directory update-to-date if I understand the related code correctly. For standby, checkpoint redo does not ensure that. > > Your patch creates missing directories in the destination. Don't we > need to create the tablespace symlink under pg_tblspc/? I would > 'create db with tablespace' redo log does not include the tablespace real directory information. Yes, we could add in it into the xlog, but that seems to be an overdesign. > prefer extending the invalid page mechanism to deal with this, as > suggested by Ashwin off-list. It will allow us to avoid creating directories and files only to remove them shortly afterwards when the > drop database and drop tablespace records are replayed. > > 'invalid page' mechanism seems to be more proper for missing pages of a file. For missing directories, we could, of course, hack to use that (e.g. reading any page of a relfile in that database) to make sure the tablespace create code (without symlink) safer (It assumes those directories will be deleted soon). More feedback about all of the previous discussed solutions is welcome.
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database create redo error, but I suspect some other kind of redo, which depends on the files under the directory (they are not copied since the directory is not created) and also cannot be covered by the invalid page mechanism, could fail. Thanks. On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > Oops! The comment in the previous patch is wrong. > > At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro > HORIGUCHI wrote in < > 20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp> > > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo wrote in > > > > Please see my replies inline. Thanks. > > > > > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P wrote: > > > > > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo wrote: > > > > > > > > > > create db with tablespace > > > > > drop database > > > > > drop tablespace. > > > > > > > > Essentially, that sequence of operations causes crash recovery to > fail > > > > if the "drop tablespace" transaction was committed before crashing. > > > > This is a bug in crash recovery in general and should be reproducible > > > > without configuring a standby. Is that right? > > > > > > > > > > No. In general, checkpoint is done for > drop_db/create_db/drop_tablespace on > > > master. > > > That makes the file/directory update-to-date if I understand the > related > > > code correctly. > > > For standby, checkpoint redo does not ensure that. > > > > That's right partly. As you must have seen, fast shutdown forces > > restartpoint for the last checkpoint and it prevents the problem > > from happening. Anyway it seems to be a problem. > > > > > > Your patch creates missing directories in the destination. Don't we > > > > need to create the tablespace symlink under pg_tblspc/? I would > > > > > > > > > > 'create db with tablespace' redo log does not include the tablespace > real > > > directory information. > > > Yes, we could add in it into the xlog, but that seems to be an > overdesign. > > > > But I don't think creating directory that is to be removed just > > after is a wanted solution. The directory most likely to be be > > removed just after. > > > > > > prefer extending the invalid page mechanism to deal with this, as > > > > suggested by Ashwin off-list. It will allow us to avoid creating > > > > > > directories and files only to remove them shortly afterwards when the > > > > drop database and drop tablespace records are replayed. > > > > > > > > > > > 'invalid page' mechanism seems to be more proper for missing pages of a > > > file. For > > > missing directories, we could, of course, hack to use that (e.g. > reading > > > any page of > > > a relfile in that database) to make sure the tablespace create code > > > (without symlink) > > > safer (It assumes those directories will be deleted soon). > > > > > > More feedback about all of the previous discussed solutions is welcome. > > > > It doesn't seem to me that the invalid page mechanism is > > applicable in straightforward way, because it doesn't consider > > simple file copy. > > > > Drop failure is ignored any time. I suppose we can ignore the > > error to continue recovering as far as recovery have not reached > > consistency. The attached would work *at least* your case, but I > > haven't checked this covers all places where need the same > > treatment. > > The comment for the new function XLogMakePGDirectory is wrong: > > + * There is a possibility that WAL replay causes a creation of the same > + * directory left by the previous crash. Issuing ERROR prevents the caller > + * from continuing recovery. > > The correct one is: > > + * There is a possibility that WAL replay causes an error by creation of a > + * directory under a directory removed before the previous crash. Issuing > + * ERROR prevents the caller from continuing recovery. > > It is fixed in the attached. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > >
[Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).
Hello, Postgres hackers. I happened to see a hang issue when running a simple copy query. The root cause and repro way are quite simple. mkfifo /tmp/a run sql: copy (select generate_series(1, 10)) to '/tmp/a'; It hangs at AllocateFile()->fopen() because that file was previously created as a fifo file, and it is not ctrl+c cancellable (on Linux). #0 0x7f52df1c45a0 in __open_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x7f52df151f20 in _IO_file_open (is32not64=4, read_write=4, prot=438, posix_mode=, filename=0x7ffe64199a10 "\360\303[\001", fp=0x1649c40) at fileops.c:229 #2 _IO_new_file_fopen (fp=fp@entry=0x1649c40, filename=filename@entry=0x15bc3f0 "/tmp/a", mode=, mode@entry=0xaa0bb7 "w", is32not64=is32not64@entry=1) at fileops.c:339 #3 0x7f52df1465e4 in __fopen_internal (filename=0x15bc3f0 "/tmp/a", mode=0xaa0bb7 "w", is32=1) at iofopen.c:90 #4 0x007a0e90 in AllocateFile (name=0x15bc3f0 "/tmp/a", mode=mode@entry=0xaa0bb7 "w") at fd.c:2229 #5 0x005c51b4 in BeginCopyTo (pstate=pstate@entry=0x15b95f0, rel=rel@entry=0x0, query=, queryRelId=queryRelId@entry=0, filename=, is_program=, attnamelist=0x0, options=0x0) at copy.c:1919 #6 0x005c8999 in DoCopy (pstate=pstate@entry=0x15b95f0, stmt=stmt@entry=0x1596b60, stmt_location=0, stmt_len=48, processed=processed@entry=0x7ffe64199cd8) at copy.c:1078 #7 0x007d717a in standard_ProcessUtility (pstmt=0x1596ea0, queryString=0x1595dc0 "copy (select generate_series(1, 10)) to '/tmp/a';", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1596f80, completionTag=0x7ffe64199f90 "") at utility.c:551 This is, in theory, not a 100% bug, but it is probably not unusual to see conflicts of files between postgresql sqls and other applications on the same node so I think the fix is needed. I checked all code that calls AllocateFile() and wrote a simple patch to do sanity check (if the file exists it must be a regular file) for those files which are probably out of the postgres data directories which we probably want to ignore. This is actually not a perfect fix since it is not atomic (check and open), but it should fix most of the scenarios. To be perfect, we might want to refactor AllocateFile() to allow atomic check&create using either 'x' in fopen() or O_EXCL in open(), also it seems that we might not want to create temp file for AllocateFile() with fixed filenames. This is beyond of this patch of course. Thanks. 0001-Make-sure-the-file-is-a-regular-file-if-it-exists-be.patch Description: Binary data
Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).
On Wed, Apr 24, 2019 at 12:49 PM Andres Freund wrote: > Hi, > > On 2019-04-24 12:46:15 +0800, Paul Guo wrote: > > This is, in theory, not a 100% bug, but it is probably not unusual to see > > conflicts of files between postgresql sqls and other applications on the > > same node so I think the fix is needed. I checked all code that calls > > AllocateFile() and wrote a simple patch to do sanity check (if the file > > exists it must be a regular file) for those files which are probably out > of > > the postgres data directories which we probably want to ignore. This is > > actually not a perfect fix since it is not atomic (check and open), but > it > > should fix most of the scenarios. To be perfect, we might want to > refactor > > AllocateFile() to allow atomic check&create using either 'x' in fopen() > > or O_EXCL in open(), also it seems that we might not want to create temp > > file for AllocateFile() with fixed filenames. This is beyond of this > patch > > of course. > > This seems like a bad idea to me. IMO we want to support using a pipe > etc here. If the admin creates a fifo like this without attaching a > program it seems like it's their fault. > Oh, I never know this application scenario before. So yes, for this, we need to keep the current code logic in copy code. > > Greetings, > > Andres Freund >
Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).
On Wed, Apr 24, 2019 at 10:36 PM Tom Lane wrote: > Alvaro Herrera writes: > > On 2019-Apr-24, Paul Guo wrote: > >> On Wed, Apr 24, 2019 at 12:49 PM Andres Freund > wrote: > >>> This seems like a bad idea to me. IMO we want to support using a pipe > >>> etc here. If the admin creates a fifo like this without attaching a > >>> program it seems like it's their fault. > > >> Oh, I never know this application scenario before. So yes, for this, we > >> need to keep the current code logic in copy code. > > > But the pgstat.c patch seems reasonable to me. > > Nah, I don't buy that one either. Nobody has any business creating any > non-Postgres files in the stats directory ... and if somebody does want > to stick a FIFO in there, perhaps for debug purposes, why should we stop > them? > For the pgstat case, the files for AllocateFile() are actually temp files which are soon renamed to other file names. Users might not want to set them as fifo files. For developers 'tail -f' might be sufficient for debugging purpose. const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname; fpout = AllocateFile(tmpfile, PG_BINARY_W); fwrite(fpout, ...); rename(tmpfile, statfile); I'm not sure if those hardcoded temp filenames (not just those in pgstat) are used across postgres reboot. If no, we should instead call glibc function to create unique temp files and also remove those hardcode temp filename variables, else we also might want them to be regular files. > The case with COPY is a bit different, since there it's reasonable to be > worried about collisions with other users' files --- but I agree with > Andres that this change would eliminate too many valid use-cases. > > regards, tom lane >
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > Mmm. I posted to wrong thread. Sorry. > > At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro > HORIGUCHI wrote in < > 20190423.163949.36763221.horiguchi.kyot...@lab.ntt.co.jp> > > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo wrote in > > > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this > database > > > create redo error, but I suspect some other kind of redo, which > depends on > > > the files under the directory (they are not copied since the directory > is > > > not created) and also cannot be covered by the invalid page mechanism, > > > could fail. Thanks. > > > > If recovery starts from just after tablespace creation, that's > > simple. The Symlink to the removed tablespace is already removed > > in the case. Hence server innocently create files directly under > > pg_tblspc, not in the tablespace. Finally all files that were > > supposed to be created in the removed tablespace are removed > > later in recovery. > > > > If recovery starts from recalling page in a file that have been > > in the tablespace, XLogReadBufferExtended creates one (perhaps > > directly in pg_tblspc as described above) and the files are > > removed later in recoery the same way to above. This case doen't > > cause FATAL/PANIC during recovery even in master. > > > > XLogReadBufferExtended@xlogutils.c > > | * Create the target file if it doesn't already exist. This lets us > cope > > | * if the replay sequence contains writes to a relation that is later > > | * deleted. (The original coding of this routine would instead suppress > > | * the writes, but that seems like it risks losing valuable data if the > > | * filesystem loses an inode during a crash. Better to write the data > > | * until we are actually told to delete the file.) > > > > So buffered access cannot be a problem for the reason above. The > > remaining possible issue is non-buffered access to files in > > removed tablespaces. This is what I mentioned upthread: > > > > me> but I haven't checked this covers all places where need the same > > me> treatment. > > RM_DBASE_ID is fixed by the patch. > > XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG: > - are not relevant. > > HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC: > - Resources works on buffer is not affected. > > SMGR: > - Both CREATE and TRUNCATE seems fine. > > TBLSPC: > - We don't nest tablespace directories. No Problem. > > I don't find a similar case. I took some time in digging into the related code. It seems that ignoring if the dst directory cannot be created directly should be fine since smgr redo code creates tablespace path finally by calling TablespaceCreateDbspace(). What's more, I found some more issues. 1) The below error message is actually misleading. 2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547 That should be due to dbase_desc(). It could be simply fixed following the code logic in GetDatabasePath(). 2) It seems that src directory could be missing then dbase_redo()->copydir() could error out. For example, \!rm -rf /tmp/tbspace1 \!mkdir /tmp/tbspace1 \!rm -rf /tmp/tbspace2 \!mkdir /tmp/tbspace2 create tablespace tbs1 location '/tmp/tbspace1'; create tablespace tbs2 location '/tmp/tbspace2'; create database db1 tablespace tbs1; alter database db1 set tablespace tbs2; drop tablespace tbs1; Let's say, the standby finishes all replay but redo lsn on pg_control is still the point at 'alter database', and then kill postgres, then in theory when startup, dbase_redo()->copydir() will ERROR since 'drop tablespace tbs1' has removed the directories (and symlink) of tbs1. Below simple code change could fix that. diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 9707afabd9..7d755c759e 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record) */ FlushDatabaseBuffers(xlrec->src_db_id); + /* +* It is possible that the source directory is missing if +* we are re-replaying the xlog while subsequent xlogs +* drop the tablespace in previous replaying. For this +* we just skip. +*/ + if (!(stat(src_path, &st) == 0 &
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
I updated the original patch to 1) skip copydir() if either src path or dst parent path is missing in dbase_redo(). Both missing cases seem to be possible. For the src path missing case, mkdir_p() is meaningless. It seems that moving the directory existence check step to dbase_redo() has less impact on other code. 2) Fixed dbase_desc(). Now the xlog output looks correct. rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn: 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to pg_tblspc/16384/PG_12_201904281/16386 rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn: 0/01638EB8, prev 0/01638E40, desc: DROP dir pg_tblspc/16384/PG_12_201904281/16386 I'm not familiar with the TAP test details previously. I learned a lot about how to test such case from Kyotaro's patch series.👍 On Sun, Apr 28, 2019 at 3:33 PM Paul Guo wrote: > > On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI < > horiguchi.kyot...@lab.ntt.co.jp> wrote: > >> Mmm. I posted to wrong thread. Sorry. >> >> At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro >> HORIGUCHI wrote in < >> 20190423.163949.36763221.horiguchi.kyot...@lab.ntt.co.jp> >> > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo wrote >> in >> > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this >> database >> > > create redo error, but I suspect some other kind of redo, which >> depends on >> > > the files under the directory (they are not copied since the >> directory is >> > > not created) and also cannot be covered by the invalid page mechanism, >> > > could fail. Thanks. >> > >> > If recovery starts from just after tablespace creation, that's >> > simple. The Symlink to the removed tablespace is already removed >> > in the case. Hence server innocently create files directly under >> > pg_tblspc, not in the tablespace. Finally all files that were >> > supposed to be created in the removed tablespace are removed >> > later in recovery. >> > >> > If recovery starts from recalling page in a file that have been >> > in the tablespace, XLogReadBufferExtended creates one (perhaps >> > directly in pg_tblspc as described above) and the files are >> > removed later in recoery the same way to above. This case doen't >> > cause FATAL/PANIC during recovery even in master. >> > >> > XLogReadBufferExtended@xlogutils.c >> > | * Create the target file if it doesn't already exist. This lets us >> cope >> > | * if the replay sequence contains writes to a relation that is later >> > | * deleted. (The original coding of this routine would instead >> suppress >> > | * the writes, but that seems like it risks losing valuable data if the >> > | * filesystem loses an inode during a crash. Better to write the data >> > | * until we are actually told to delete the file.) >> > >> > So buffered access cannot be a problem for the reason above. The >> > remaining possible issue is non-buffered access to files in >> > removed tablespaces. This is what I mentioned upthread: >> > >> > me> but I haven't checked this covers all places where need the same >> > me> treatment. >> >> RM_DBASE_ID is fixed by the patch. >> >> XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG: >> - are not relevant. >> >> HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC: >> - Resources works on buffer is not affected. >> >> SMGR: >> - Both CREATE and TRUNCATE seems fine. >> >> TBLSPC: >> - We don't nest tablespace directories. No Problem. >> >> I don't find a similar case. > > > I took some time in digging into the related code. It seems that ignoring > if the dst directory cannot be created directly > should be fine since smgr redo code creates tablespace path finally by > calling TablespaceCreateDbspace(). > What's more, I found some more issues. > > 1) The below error message is actually misleading. > > 2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory > "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory > 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for > Database/CREATE: copy dir 1663/1 to 65546/65547 > > That should be due to dbase_desc(). It could be simply fixed following the > code logic in GetDatabasePath(). > > 2) It seems that src directory could be missing then > dbase_redo()->copydir() could error out. For example, > > \!rm -r
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Thanks for the reply. On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > > + if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) > + { > > This patch is allowing missing source and destination directory > even in consistent state. I don't think it is safe. > I do not understand this. Can you elaborate? > > > > +ereport(WARNING, > +(errmsg("directory \"%s\" for copydir() does not exists." > +"It is possibly expected. Skip copydir().", > +parent_path))); > > This message seems unfriendly to users, or it seems like an elog > message. How about something like this. The same can be said for > the source directory. > > | WARNING: skipped creating database directory: "%s" > | DETAIL: The tabelspace %u may have been removed just before crash. > Yeah. Looks better. > > # I'm not confident in this at all:( > > > 2) Fixed dbase_desc(). Now the xlog output looks correct. > > > > rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn: > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to > > pg_tblspc/16384/PG_12_201904281/16386 > > > > rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn: > > 0/01638EB8, prev 0/01638E40, desc: DROP dir > > pg_tblspc/16384/PG_12_201904281/16386 > > WAL records don't convey such information. The previous > description seems right to me. > 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547 The directories are definitely wrong and misleading. > > > Also I'd suggest we should use pg_mkdir_p() in > TablespaceCreateDbspace() > > > to replace > > > the code block includes a lot of > > > get_parent_directory(), MakePGDirectory(), etc even it > > > is not fixing a bug since pg_mkdir_p() code change seems to be more > > > graceful and simpler. > > But I don't agree to this. pg_mkdir_p goes above two-parents up, > which would be unwanted here. > > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'. This change just makes the code concise. Though in theory the change is not needed. > > > Whatever ignore mkdir failure or mkdir_p, I found that these steps > seem to > > > be error-prone > > > along with postgre evolving since they are hard to test and also we are > > > not easy to think out > > > various potential bad cases. Is it possible that we should do real > > > checkpoint (flush & update > > > redo lsn) when seeing checkpoint xlogs for these operations? This will > > > slow down standby > > > but master also does this and also these operations are not usual, > > > espeically it seems that it > > > does not slow down wal receiving usually? > > That dramatically slows recovery (not replication) if databases > are created and deleted frequently. That wouldn't be acceptable. > This behavior is rare and seems to have the same impact on master & standby from checkpoint/restartpoint. We do not worry about master so we should not worry about standby also.
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > Hello. > > At Mon, 13 May 2019 17:37:50 +0800, Paul Guo wrote in < > caeet0zf9yn4daxyuflzocayyxuff1ms_oqwea+rwv3gha5q...@mail.gmail.com> > > Thanks for the reply. > > > > On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI < > > horiguchi.kyot...@lab.ntt.co.jp> wrote: > > > > > > > > + if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) > > > + { > > > > > > This patch is allowing missing source and destination directory > > > even in consistent state. I don't think it is safe. > > > > > > > I do not understand this. Can you elaborate? > > Suppose we were recoverying based on a backup at LSN1 targeting > to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2 > is called as "consistency point", before where the database is > not consistent. It's because we are applying WAL recored older > than those that were already applied in the second trial. The > same can be said for crash recovery, where LSN1 is the latest > checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN. > > Creation of an existing directory or dropping of a non-existent > directory are apparently inconsistent or "broken" so we should > stop recovery when seeing such WAL records while database is in > consistent state. > This seems to be hard to detect. I thought using invalid_page mechanism long ago, but it seems to be hard to fully detect a dropped tablespace. > > > 2) Fixed dbase_desc(). Now the xlog output looks correct. > > > > > > > > rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn: > > > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to > > > > pg_tblspc/16384/PG_12_201904281/16386 > > > > > > > > rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn: > > > > 0/01638EB8, prev 0/01638E40, desc: DROP dir > > > > pg_tblspc/16384/PG_12_201904281/16386 > > > > > > WAL records don't convey such information. The previous > > > description seems right to me. > > > > > > > 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for > > Database/CREATE: copy dir 1663/1 to 65546/65547 > > The directories are definitely wrong and misleading. > > The original description is right in the light of how the server > recognizes. The record exactly says that "copy dir 1663/1 to > 65546/65547" and the latter path is converted in filesystem layer > via a symlink. > In either $PG_DATA/pg_tblspc or symlinked real tablespace directory, there is an additional directory like PG_12_201905221 between tablespace oid and database oid. See the directory layout as below, so the directory info in xlog dump output was not correct. $ ls -lh data/pg_tblspc/ total 0 lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2 $ ls -lh /tmp/2 total 0 drwx--. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221 > > > > > > > Also I'd suggest we should use pg_mkdir_p() in > > > TablespaceCreateDbspace() > > > > > to replace > > > > > the code block includes a lot of > > > > > get_parent_directory(), MakePGDirectory(), etc even it > > > > > is not fixing a bug since pg_mkdir_p() code change seems to be more > > > > > graceful and simpler. > > > > > > But I don't agree to this. pg_mkdir_p goes above two-parents up, > > > which would be unwanted here. > > > > > > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'. > > This change just makes the code concise. Though in theory the change is > not > > needed. > > We don't want to create tablespace direcotory after concurrent > DROPing, as the comment just above is saying: > > | * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE > | * or TablespaceCreateDbspace is running concurrently. > > If the concurrent DROP TABLESPACE destroyed the grand parent > directory, we mustn't create it again. > Yes, this is a good reason to keep the original code. Thanks. By the way, based on your previous test patch I added another test which could easily detect the missing src directory case.
Re: undefined symbol: PQgssEncInUse
Have you used the correct libpq library? If yes, you might want to check the build logs and related files to see where is wrong. In my environment, it's ok with both gssapi enabled or disabled. On Thu, May 30, 2019 at 9:22 AM Donald Dong wrote: > Hi, > > After I make temp-install on HEAD with a clean build, I fail to start > psql (tmp_install/usr/local/pgsql/bin/psql) and get an error message: > > ./psql: symbol lookup error: ./psql: undefined symbol: PQgssEncInUse > > However, make check and other tests still work. For me, it is fine > until commit b0b39f72b9904bcb80f97b35837ccff1578aa4b8. I wonder if > this only occurs to me? > > Thank you, > Donald Dong > > >
A WalSnd issue related to state WALSNDSTATE_STOPPING
Hello, Recently I encountered a panic (assert failure) on a cassert enabled build of Greenplum database (an MPP database based on postgres). The based postgresql version is 9.4 stable (9.4.19). The panic is not complex. Please see below for more details. This issue seems to be a bug in postgresql also so I'm writing to community. (gdb) bt #0 0x7fb8c56321f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7fb8c56338e8 in __GI_abort () at abort.c:90 #2 0x00ae3fd2 in ExceptionalCondition (conditionName=0xdcec75 "!(syncWalSnd)", errorType=0xdce77a "FailedAssertion", fileName=0xdce770 "syncrep.c", lineNumber=505) at assert.c:66 #3 0x00924dbc in SyncRepReleaseWaiters () at syncrep.c:505 #4 0x00915e5e in ProcessStandbyReplyMessage () at walsender.c:1667 #5 0x00915bea in ProcessStandbyMessage () at walsender.c:1566 #6 0x00915ab5 in ProcessRepliesIfAny () at walsender.c:1502 #7 0x00916303 in WalSndLoop (send_data=0x916d46 ) at walsender.c:1923 #8 0x0091439b in StartReplication (cmd=0x15abea8) at walsender.c:704 #9 0x0091586e in exec_replication_command ( cmd_string=0x161fa78 "START_REPLICATION 1/8400 TIMELINE 6") at walsender.c:1416 #10 0x00979915 in PostgresMain (argc=1, argv=0x15bafd8, dbname=0x15baec8 "", username=0x15baea8 "gpadmin") at postgres.c:5296 (gdb) p max_wal_senders $1 = 1 (gdb) p WalSndCtl->walsnds[0] $2 = { pid = 2765, state = WALSNDSTATE_STOPPING, sentPtr = 6509560784, sendKeepalive = 0 '\000', needreload = 0 '\000', write = 6509560784, flush = 6509560664, apply = 6509560664, caughtup_within_range = 1 '\001', xlogCleanUpTo = 6509560664, mutex = 0 '\000', latch = { is_set = 1, is_shared = 1 '\001', owner_pid = 2765 }, sync_standby_priority = 1, synchronous = 0 '\000', replica_disconnected_at = 0 } Below is related code. for (i = 0; i < max_wal_senders; i++) { /* use volatile pointer to prevent code rearrangement */ volatile WalSnd *walsnd = &walsndctl->walsnds[i]; if (walsnd->pid != 0 && walsnd->state >= WALSNDSTATE_STREAMING && walsnd->sync_standby_priority > 0 && (priority == 0 || priority > walsnd->sync_standby_priority) && !XLogRecPtrIsInvalid(walsnd->flush)) { priority = walsnd->sync_standby_priority; syncWalSnd = walsnd; } } /* * We should have found ourselves at least. */ Assert(syncWalSnd); The panic reason is that since there is just one wal sender and WalSndCtl->walsnds[0].state is WALSNDSTATE_STOPPING so syncWalSnd will be NULL and that causes assert failure. Latest postgresql code removes the Assert code but the related similar code logic was kept. It looks like that we need to modify the logic similar like below (PG 9.4 STABLE) to allow WALSNDSTATE_STOPPING which is reasonable here If I understand correctly. if (walsnd->pid != 0 && - walsnd->state == WALSNDSTATE_STREAMING && + walsnd->state >= WALSNDSTATE_STREAMING && walsnd->sync_standby_priority > 0 && (priority == 0 || priority > walsnd->sync_standby_priority) && For Latest master, the logic is as below but it could be modified similarly. /* Must be streaming */ if (state != WALSNDSTATE_STREAMING) continue; If yes this change should be on all branches that have the patch which introduces WALSNDSTATE_STOPPING. Thanks.
Re: A WalSnd issue related to state WALSNDSTATE_STOPPING
On Thu, Nov 22, 2018 at 1:29 PM Michael Paquier wrote: > On Wed, Nov 21, 2018 at 04:09:41PM +0900, Michael Paquier wrote: > > The checkpointer initializes a shutdown checkpoint where it tells to all > > the WAL senders to stop once all the children processes are gone, so it > > seems to me that there is little point in processing > > SyncRepReleaseWaiters() when a WAL sender is in WALSNDSTATE_STOPPING > > state as at this stage syncrep makes little sense. It is still > > necessary to process standby messages at this stage so as the primary > > can shut down when it is sure that all the standbys have flushed the > > shutdown checkpoint record of the primary. > > Just refreshed my memory with c6c33343, which is actually at the origin > of the issue, and my previous argument is flawed. If a WAL sender has > reached WALSNDSTATE_STOPPING no regular backends are around but a WAL > sender could always commit a transaction in parallel which may need to > make sure that its record is flushed and sync'ed, and this needs to make > sure that waiters are correctly released. So it is necessary to patch > up SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum as > mentioned upthread, perhaps adding a comment when looking at > MyWalSnd->state looks adapted. Paul, would you like to write a patch? > -- > Michael > Yes, please see the attached patch. 0001-Allow-stopping-wal-senders-to-be-invovled-in-SyncRep.patch Description: Binary data
Re: A WalSnd issue related to state WALSNDSTATE_STOPPING
> > On Thu, Nov 22, 2018 at 06:31:04PM +0800, Paul Guo wrote: > > Yes, please see the attached patch. > > Thanks, I have reviewed your patch, and could not resist to change > SyncRepReleaseWaiters() on top of the rest with a comment to not be > confused about the WAL sender states allowed to release waiters. > > The previous experience is proving that it seems unwise to rely on the > order of the elements in WalSndState, so I have tweaked things so as the > state is listed for all the code paths involved. > > Paul, what do you think? > -- > Michael > Agree. Your code change is safer.
Two small issues related to table_relation_copy_for_cluster() and CTAS with no data.
Hello hackers, While reading the latest master branch code, I found something that we may be able to improve. 1. The am table_relation_copy_for_cluster() interface. static inline void table_relation_copy_for_cluster(Relation OldTable, Relation NewTable, Relation OldIndex, bool use_sort, TransactionId OldestXmin, TransactionId *xid_cutoff, MultiXactId *multi_cutoff, double *num_tuples, double *tups_vacuumed, double *tups_recently_dead) - Should add a line for parameter num_tuples below "Output parameters " in comment - Look at the caller code, i.e. copy_table_data(). It does initialize *num_tuples, *tups_vacuumed and *tups_recently_dead at first. This does not seem to be a good API design or implementation. We'd better let the am api return the values without initializing from callers, right? 2. For CTAS (create table as) with no data. It seems that we won't run into intorel_receive(). intorel_startup() could be run into for "create table as t1 execute with no data". So it looks like we do not need to judge for into->skipData in intorel_receive(). If we really want to check into->skipData we could add an assert check there or if I missed some code paths in which we could be run into the code branch, we could instead call below code in intorel_receive() to stop early, right? if (myState->into->skipData) return false; Regards, Paul
pg_rewind fails if there is a read only file.
Several weeks ago I saw this issue in a production environment. The read only file looks like a credential file. Michael told me that usually such kinds of files should be better kept in non-pgdata directories in production environments. Thought further it seems that pg_rewind should be more user friendly to tolerate such scenarios. The failure error is "Permission denied" after open(). The reason is open() fais with the below mode in open_target_file() mode = O_WRONLY | O_CREAT | PG_BINARY; if (trunc) mode |= O_TRUNC; dstfd = open(dstpath, mode, pg_file_create_mode); The fix should be quite simple, if open fails with "Permission denied" then we try to call chmod to add a S_IWUSR just before open() and call chmod to reset the flags soon after open(). A stat() call to get previous st_mode flags is needed. Any other suggestions or thoughts? Thanks, -- Paul Guo (Vmware)
Re: pg_rewind fails if there is a read only file.
> Presumably the user has a reason for adding the file read-only to the > data directory, and we shouldn't lightly ignore that. > > Michael's advice is reasonable. This seems like a case of: I agree. Attached is a short patch to handle the case. The patch was tested in my dev environment. v1-0001-Fix-pg_rewind-failure-due-to-read-only-file-open-.patch Description: Binary data
sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
Hi hackers, I found this when reading the related code. Here is the scenario: bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, bool retryOnError) For the case retryOnError is true, the function would in loop call ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), we can see if we run into the below branch, RegisterSyncRequest() will need to loop until the checkpointer absorbs the existing requests so ForwardSyncRequest() might hang for some time until a checkpoint request is triggered. This scenario seems to be possible in theory though the chance is not high. ForwardSyncRequest(): if (CheckpointerShmem->checkpointer_pid == 0 || (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && !CompactCheckpointerRequestQueue())) { /* * Count the subset of writes where backends have to do their own * fsync */ if (!AmBackgroundWriterProcess()) CheckpointerShmem->num_backend_fsync++; LWLockRelease(CheckpointerCommLock); return false; } One fix is to add below similar code in RegisterSyncRequest(), trigger a checkpoint for the scenario. // checkpointer_triggered: variable for one trigger only. if (!ret && retryOnError && ProcGlobal->checkpointerLatch && !checkpointer_triggered) SetLatch(ProcGlobal->checkpointerLatch); Any comments? Regards, Paul Guo (Vmware)
Re: pg_rewind fails if there is a read only file.
> You seem to have missed my point completely. The answer to this problem > IMNSHO is "Don't put a read-only file in the data directory". Oh sorry. Well, if we really do not want this we may want to document this and keep educating users, but meanwhile probably the product should be more user friendly for the case, especially considering that we know the fix would be trivial and I suspect it is inevitable that some extensions put some read only files (e.g. credentials files) in pgdata.
Re: pg_rewind fails if there is a read only file.
On Wed, May 26, 2021 at 10:32 PM Andrew Dunstan wrote: > > > On 5/26/21 2:43 AM, Laurenz Albe wrote: > > On Wed, 2021-05-26 at 08:57 +0900, Michael Paquier wrote: > >> On Tue, May 25, 2021 at 03:17:37PM -0400, Andrew Dunstan wrote: > >>> If we do decide to do something the question arises what should it do? > >>> If we're to allow for it I'm wondering if the best thing would be simply > >>> to ignore such a file. > >> Enforcing assumptions that any file could be ready-only is a very bad > >> idea, as that could lead to weird behaviors if a FS is turned as > >> becoming read-only suddenly while doing a rewind. Another idea that > >> has popped out across the years was to add an option to pg_rewind so > >> as users could filter files manually. That could be easily dangerous > >> though in the wrong hands, as one could think that it is a good idea > >> to skip a control file, for example. > >> > >> The thing is that here we actually know the set of files we'd like to > >> ignore most of the time, and we still want to have some automated > >> control what gets filtered. So here is a new idea: we build a list of > >> files based on a set of GUC parameters using postgres -C on the target > >> data folder, and assume that these are safe enough to be skipped all > >> the time, if these are in the data folder. > > That sounds complicated, but should work. > > There should be a code comment somewhere that warns people not to forget > > to look at that when they add a new GUC. > > > > I can think of two alternatives to handle this: > > > > - Skip files that cannot be opened for writing and issue a warning. > > That is simple, but coarse. > > A slightly more sophisticated version would first check if files > > are the same on both machines and skip the warning for those. > > > > - Paul's idea to try and change the mode on the read-only file > > and reset it to the original state after pg_rewind is done. > > > > How about we only skip (with a warning) read-only files if they are in > the data root, not a subdirectory? That would save us from silently The assumption seems to limit the user scenario. > ignoring read-only filesystems and not involve using a GUC. How about this: By default, change and reset the file mode before and after open() for read only files, but we also allow to pass skip-file names to pg_rewind (e.g. pg_rewind --skip filenameN or --skip-list-file listfile) in case users really want to skip some files (probably not just read only files). Presumably the people who run pg_rewind should be DBA or admin that have basic knowledge of this so we should not worry too much about that the user skips some important files (we could even set a deny list for this). Also this solution works fine for a read only file system since pg_rewind soon quits when adding the write permission for any read only file. -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud wrote: > > On Tue, May 25, 2021 at 4:39 PM Paul Guo wrote: > > > > Hi hackers, > > > > I found this when reading the related code. Here is the scenario: > > > > bool > > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, > > bool retryOnError) > > > > For the case retryOnError is true, the function would in loop call > > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), > > we can see if we run into the below branch, RegisterSyncRequest() will > > need to loop until the checkpointer absorbs the existing requests so > > ForwardSyncRequest() might hang for some time until a checkpoint > > request is triggered. This scenario seems to be possible in theory > > though the chance is not high. > > It seems like a really unlikely scenario, but maybe possible if you > use a lot of unlogged tables maybe (as you could eventually > dirty/evict more than NBuffers buffers without triggering enough WALs > activity to trigger a checkpoint with any sane checkpoint > configuration). RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for buffer sync. > > ForwardSyncRequest(): > > > > if (CheckpointerShmem->checkpointer_pid == 0 || > > (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests > > && > > !CompactCheckpointerRequestQueue())) > > { > > /* > > * Count the subset of writes where backends have to do their own > > * fsync > > */ > > if (!AmBackgroundWriterProcess()) > > CheckpointerShmem->num_backend_fsync++; > > LWLockRelease(CheckpointerCommLock); > > return false; > > } > > > > One fix is to add below similar code in RegisterSyncRequest(), trigger > > a checkpoint for the scenario. > > > > // checkpointer_triggered: variable for one trigger only. > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch && > > !checkpointer_triggered) > > SetLatch(ProcGlobal->checkpointerLatch); > > > > Any comments? > > It looks like you intended to set the checkpointer_triggered var but Yes this is just pseduo code. > didn't. Also this will wake up the checkpointer but won't force a > checkpoint (unlike RequestCheckpoint()). It may be a good thing I do not expect an immediate checkpoint. AbsorbSyncRequests() is enough since after that RegisterSyncRequest() could finish. > though as it would only absorb the requests and go back to sleep if no > other threshold is reachrf. Apart from the implementation details it > seems like it could help in this unlikely event. -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 9:59 PM Paul Guo wrote: > > On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud wrote: > > > > On Tue, May 25, 2021 at 4:39 PM Paul Guo wrote: > > > > > > Hi hackers, > > > > > > I found this when reading the related code. Here is the scenario: > > > > > > bool > > > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, > > > bool retryOnError) > > > > > > For the case retryOnError is true, the function would in loop call > > > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), > > > we can see if we run into the below branch, RegisterSyncRequest() will > > > need to loop until the checkpointer absorbs the existing requests so > > > ForwardSyncRequest() might hang for some time until a checkpoint > > > request is triggered. This scenario seems to be possible in theory > > > though the chance is not high. > > > > It seems like a really unlikely scenario, but maybe possible if you > > use a lot of unlogged tables maybe (as you could eventually > > dirty/evict more than NBuffers buffers without triggering enough WALs > > activity to trigger a checkpoint with any sane checkpoint > > configuration). > > RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and > SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for > buffer sync. > > > > ForwardSyncRequest(): > > > > > > if (CheckpointerShmem->checkpointer_pid == 0 || > > > (CheckpointerShmem->num_requests >= > > > CheckpointerShmem->max_requests && > > > !CompactCheckpointerRequestQueue())) > > > { > > > /* > > > * Count the subset of writes where backends have to do their own > > > * fsync > > > */ > > > if (!AmBackgroundWriterProcess()) > > > CheckpointerShmem->num_backend_fsync++; > > > LWLockRelease(CheckpointerCommLock); > > > return false; > > > } > > > > > > One fix is to add below similar code in RegisterSyncRequest(), trigger > > > a checkpoint for the scenario. > > > > > > // checkpointer_triggered: variable for one trigger only. > > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch && > > > !checkpointer_triggered) > > > SetLatch(ProcGlobal->checkpointerLatch); > > > > > > Any comments? > > > > It looks like you intended to set the checkpointer_triggered var but > > Yes this is just pseduo code. > > > didn't. Also this will wake up the checkpointer but won't force a > > checkpoint (unlike RequestCheckpoint()). It may be a good thing > > I do not expect an immediate checkpoint. AbsorbSyncRequests() > is enough since after that RegisterSyncRequest() could finish. > > > though as it would only absorb the requests and go back to sleep if no > > other threshold is reachrf. Apart from the implementation details it > > seems like it could help in this unlikely event. > Also note that ForwardSyncRequest() does wake up the checkpointer if it thinks the requests in shared memory are "too full", but does not wake up when the request is actually full. This does not seem to be reasonable. See below code in ForwardSyncRequest /* If queue is more than half full, nudge the checkpointer to empty it */ too_full = (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests / 2); /* ... but not till after we release the lock */ if (too_full && ProcGlobal->checkpointerLatch) SetLatch(ProcGlobal->checkpointerLatch); -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
> You said "trigger a checkpoint", which sounded more like forcing a > checkpointer rather than waking up the checkpointer so that it can > absorb the pending requests, so it seems worth to mention what it > would really do. Yes it is not accurate. Thanks for the clarification. -- Paul Guo (Vmware)
Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?
On Thu, May 27, 2021 at 10:22 PM Julien Rouhaud wrote: > > On Thu, May 27, 2021 at 10:05 PM Paul Guo wrote: > > > > Also note that ForwardSyncRequest() does wake up the checkpointer if > > it thinks the requests in shared memory are "too full", but does not > > wake up when the request is actually full. This does not seem to be > > reasonable. > > See below code in ForwardSyncRequest > > > > /* If queue is more than half full, nudge the checkpointer to empty it > > */ > > too_full = (CheckpointerShmem->num_requests >= > > CheckpointerShmem->max_requests / 2); > > > > /* ... but not till after we release the lock */ > > if (too_full && ProcGlobal->checkpointerLatch) > > SetLatch(ProcGlobal->checkpointerLatch); > > Ah indeed. Well it means that the checkpointer it woken up early > enough to avoid reaching that point. I'm not sure that it's actually > possible to reach a point where the list if full and the checkpointer > is sitting idle. In theory this is possible (when the system is under heavy parallel write) else we could remove that part code (CompactCheckpointerRequestQueue()) :-), though the chance is not high. If we encounter this issue those affected queries would suddenly hang until the next checkpointer wakeup.
Re: Two patches to speed up pg_rewind.
> On 2021/2/19, 10:33 AM, "Paul Guo" wrote: > Refactored the code a bit along with fixes. Manually tested them on centos > & Ubuntu (the later has copy_file_range()) > For the first patch, actually I have some concerns. My assumption is that > the target pg_data directory should be fsync-ed already. This should be > correct normally but there is one scenario: a cleanly-shutdown database’s > pgdata directory was copied to another directory, in this case the new pgdata > is not fsync-ed - I’m not sure if that exists in real production environment > or not, > but even considering this we could still use the optimization for the case > that > calls ensureCleanShutdown() since this ensures a pgdata fsync on the target. Did some small modification and rebased the code. See attached for the new version. v3-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch Description: v3-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch v3-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch Description: v3-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Should wal receiver reply to wal sender more aggressively?
While working on some related issues I found that the wal receiver tries to call walrcv_receive() loop before replying the write/flush/apply LSN to wal senders in XLogWalRcvSendReply(). It is possible that walrcv_receive() loop receives and writes a lot of xlogs, so it does not reply those LSN information in time, thus finally slows down those transactions due to syncrep wait (assuming default synchronous_commit) In my TPCB testing, I found the worst case is that 10,466,469 bytes were consumed in the walrcv_receive() loop. More seriously, we call XLogWalRcvSendReply(false, false) after handling those bytes; The first argument false means no force , i.e. it notifies unless max time of guc wal_receiver_status_interval value (10s by default) is reached, so we may have to wait other calls of XLogWalRcvSendReply() to notify the wal sender. I thought and tried enhancing this by force-replying to the wal sender each when receiving a maximum bytes (e.g. 128K) but several things confused me: - What's the purpose of guc wal_receiver_status_interval? The OS kernel is usually not efficient when handling small packets but we are not replying that aggressively so why is this guc there? - I run simple TPCB (1000 scaling, 200 connections, shared_buffers, max_connections tuned) but found no obvious performance difference with and without the code change. I did not see obvious system (IO/CPU/network) bottleneck - probably the bottleneck is in PG itself. I did not investigate further at this moment, but the change should in theory help, no? Another thing came to my mind is the wal receiver logic: Currently the wal receiver process does network io, wal write, wal flush in one process. Network io is async, blocking at epoll/poll, wal write is mostly non-blocking, but for wal flush, probably we could decouple it to a dedicated process. And maybe use sync_file_range instead of wal file fsync in issue_xlog_fsync()? We should sync those wal contents with lower LSN at first and reply to the wal sender in time, right?. Below is the related code: /* See if we can read data immediately */ len = walrcv_receive(wrconn, &buf, &wait_fd); if (len != 0) { /* * Process the received data, and any subsequent data we * can read without blocking. */ for (;;) { if (len > 0) { /* * Something was received from primary, so reset * timeout */ last_recv_timestamp = GetCurrentTimestamp(); ping_sent = false; XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); } else if (len == 0) break; else if (len < 0) { ereport(LOG, (errmsg("replication terminated by primary server"), errdetail("End of WAL reached on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write; endofwal = true; break; } len = walrcv_receive(wrconn, &buf, &wait_fd); } /* Let the primary know that we received some data. */ XLogWalRcvSendReply(false, false); /* * If we've written some records, flush them to disk and * let the startup process and primary server know about * them. */ XLogWalRcvFlush(false); -- Paul Guo (Vmware)
Re: Should wal receiver reply to wal sender more aggressively?
[ Resending the mail since I found my previous email has a very bad format that is hard to read]. While working on some related issues I found that the wal receiver tries to call walrcv_receive() loop before replying the write/flush/apply LSN to wal senders in XLogWalRcvSendReply(). It is possible that walrcv_receive() loop receives and writes a lot of xlogs, so it does not reply those LSN information in time, thus finally slows down the transactions due to syncrep wait (assuming default synchronous_commit) During TPCB testing, I found the worst case is that 10,466,469 bytes were consumed in the walrcv_receive() loop. More seriously, we call XLogWalRcvSendReply(false, false) after handling those bytes; The first argument false means no force , i.e. it notifies unless max time of guc wal_receiver_status_interval value (10s by default) is reached, so we may have to wait for other calls of XLogWalRcvSendReply() to notify the wal sender. I thought and tried enhancing this by force-flushing-replying each time when receiving a maximum bytes (e.g. 128K) but several things confused me: - What's the purpose of guc wal_receiver_status_interval? The OS kernel is usually not efficient when handling small packets but we are not replying that aggressively so why is this guc there? - I run simple TPCB (1000 scaling, 200 connections, shared_buffers, max_connections tuned) but found no obvious performance difference with and without the code change. I did not see an obvious system IO/CPU/network) bottleneck - probably the bottleneck is in PG itself? I did not investigate further at this moment, but the change should in theory help, right? I may continue investigating but probably won't do this unless I have some clear answers to the confusions. Another thing came to my mind is the wal receiver logic: Currently the wal receiver process does network io, wal write, wal flush in one process. Network io is async, blocking at epoll/poll, etc, wal write is mostly non-blocking, but for wal flush, probably we could decouple it to a dedicated process? And maybe use sync_file_range instead of wal file fsync in issue_xlog_fsync()? We should sync those wal contents with lower LSN at first and reply to the wal sender in time, right?. Below is the related code: /* See if we can read data immediately */ len = walrcv_receive(wrconn, &buf, &wait_fd); if (len != 0) { /* * Process the received data, and any subsequent data we * can read without blocking. */ for (;;) { if (len > 0) { /* * Something was received from primary, so reset * timeout */ last_recv_timestamp = GetCurrentTimestamp(); ping_sent = false; XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); } else if (len == 0) break; else if (len < 0) { ereport(LOG, (errmsg("replication terminated by primary server"), errdetail("End of WAL reached on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write; endofwal = true; break; } len = walrcv_receive(wrconn, &buf, &wait_fd); } /* Let the primary know that we received some data. */ XLogWalRcvSendReply(false, false); /* * If we've written some records, flush them to disk and * let the startup process and primary server know about * them. */ XLogWalRcvFlush(false);
Re: Two patches to speed up pg_rewind.
No worry I’m work on this. On 2021/6/17, 3:18 PM, "Michael Paquier" wrote: On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote: > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote: > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if > > you cross a filesystem boundary, is that something we need to worry > > about there? > > Hmm. Good point. That may justify having a switch to control that. Paul, the patch set still needs some work, so I am switching it as waiting on author. I am pretty sure that we had better have a fallback implementation of copy_file_range() in src/port/, and that we are going to need an extra switch in pg_rewind to allow users to bypass copy_file_range()/EXDEV if they do a local rewind operation across different FSes with a kernel < 5.3. -- Michael
Re: Two patches to speed up pg_rewind.
On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier wrote: > > On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote: > > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote: > > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if > > > you cross a filesystem boundary, is that something we need to worry > > > about there? > > > > Hmm. Good point. That may justify having a switch to control that. > > Paul, the patch set still needs some work, so I am switching it as > waiting on author. I am pretty sure that we had better have a > fallback implementation of copy_file_range() in src/port/, and that we > are going to need an extra switch in pg_rewind to allow users to > bypass copy_file_range()/EXDEV if they do a local rewind operation > across different FSes with a kernel < 5.3. > -- I did modification on the copy_file_range() patch yesterday by simply falling back to read()+write() but I think it could be improved further. We may add a function to determine two file/path are copy_file_range() capable or not (using POSIX standard statvfs():f_fsid?) - that could be used by other copy_file_range() users although in the long run the function is not needed. And even having this we may still need the fallback code if needed. - For pg_rewind, we may just determine that ability once on src/dst pgdata, but since there might be soft link (tablespace/wal) in pgdata so we should still allow fallback for those non copy_fie_range() capable file copying. - Also it seems that sometimes copy_file_range() could return ENOTSUP/EOPNOTSUP (the file system does not support that and the kernel does not fall back to simple copying?) although this is not documented and it seems not usual? Any idea?
A cost issue in ORDER BY + LIMIT
Hello, Postgres seems to always optimize ORDER BY + LIMIT as top-k sort. Recently I happened to notice that in this scenario the output tuple number of the sort node is not the same as the LIMIT tuple number. See below, postgres=# explain analyze verbose select * from t1 order by a limit 10; QUERY PLAN -- -- Limit (cost=69446.17..69446.20 rows=10 width=4) (actual time=282.896..282.923 rows=10 loops=1) Output: a -> Sort (cost=69446.17..71925.83 rows=991862 width=4) (actual time=282.894..282.896 rows=10 l oops=1) Output: a Sort Key: t1.a Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on public.t1 (cost=0.00..44649.62 rows=991862 width=4) (actual time=0.026.. 195.438 rows=1001000 loops=1) Output: a Planning Time: 0.553 ms Execution Time: 282.961 ms (10 rows) We can see from the output that the LIMIT cost is wrong also due to this since it assumes the input_rows from the sort node is 991862 (per gdb debugging). This could be easily fixed by below patch, diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index fb28e6411a..800cf0b256 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2429,7 +2429,11 @@ cost_sort(Path *path, PlannerInfo *root, startup_cost += input_cost; - path->rows = tuples; + if (limit_tuples > 0 && limit_tuples < tuples) + path->rows = limit_tuples; + else + path->rows = tuples; + path->startup_cost = startup_cost; path->total_cost = startup_cost + run_cost; } Withe the patch the explain output looks like this. postgres=# explain analyze verbose select * from t1 order by a limit 10; QUERY PLAN -- -- Limit (cost=69446.17..71925.83 rows=10 width=4) (actual time=256.204..256.207 rows=10 loops=1) Output: a -> Sort (cost=69446.17..71925.83 rows=10 width=4) (actual time=256.202..256.203 rows=10 loops =1) Output: a Sort Key: t1.a Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on public.t1 (cost=0.00..44649.62 rows=991862 width=4) (actual time=1.014.. 169.509 rows=1001000 loops=1) Output: a Planning Time: 0.076 ms Execution Time: 256.232 ms (10 rows) Regards, Paul
Re: Two patches to speed up pg_rewind.
On Tue, Jun 22, 2021 at 11:08 AM Paul Guo wrote: > > On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier wrote: > > > > On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote: > > > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote: > > > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if > > > > you cross a filesystem boundary, is that something we need to worry > > > > about there? > > > > > > Hmm. Good point. That may justify having a switch to control that. > > > > Paul, the patch set still needs some work, so I am switching it as > > waiting on author. I am pretty sure that we had better have a > > fallback implementation of copy_file_range() in src/port/, and that we > > are going to need an extra switch in pg_rewind to allow users to > > bypass copy_file_range()/EXDEV if they do a local rewind operation > > across different FSes with a kernel < 5.3. > > -- > > I did modification on the copy_file_range() patch yesterday by simply falling > back to read()+write() but I think it could be improved further. > > We may add a function to determine two file/path are copy_file_range() > capable or not (using POSIX standard statvfs():f_fsid?) - that could be used > by other copy_file_range() users although in the long run the function > is not needed. > And even having this we may still need the fallback code if needed. > > - For pg_rewind, we may just determine that ability once on src/dst pgdata, > but > since there might be soft link (tablespace/wal) in pgdata so we should still > allow fallback for those non copy_fie_range() capable file copying. > - Also it seems that sometimes copy_file_range() could return > ENOTSUP/EOPNOTSUP > (the file system does not support that and the kernel does not fall > back to simple copying?) > although this is not documented and it seems not usual? > > Any idea? I modified the copy_file_range() patch using the below logic: If the first call of copy_file_range() fails with errno EXDEV or ENOTSUP, pg_rewind would not use copy_file_range() in rest code, and if copy_file_range() fails we fallback to use the previous read()+write() code logic for the file. v4-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch Description: Binary data v4-0002-Use-copy_file_range-if-possible-for-file-copying-.patch Description: Binary data
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Rebased. v12-0002-Tests-to-replay-create-database-operation-on-sta.patch Description: v12-0002-Tests-to-replay-create-database-operation-on-sta.patch v12-0001-Support-node-initialization-from-backup-with-tab.patch Description: v12-0001-Support-node-initialization-from-backup-with-tab.patch v12-0003-Fix-replay-of-create-database-records-on-standby.patch Description: v12-0003-Fix-replay-of-create-database-records-on-standby.patch v12-0004-Fix-database-create-drop-wal-description.patch Description: v12-0004-Fix-database-create-drop-wal-description.patch
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Wed, Aug 11, 2021 at 4:56 AM Robert Haas wrote: > > On Thu, Aug 5, 2021 at 6:20 AM Paul Guo wrote: > > Rebased. > > The commit message for 0001 is not clear enough for me to understand > what problem it's supposed to be fixing. The code comments aren't > really either. They make it sound like there's some problem with > copying symlinks but mostly they just talk about callbacks, which > doesn't really help me understand what problem we'd have if we just > didn't commit this (or reverted it later). Thanks for reviewing. Let me explain a bit. The patch series includes four patches. 0001 and 0002 are test changes for the fix (0003). - 0001 is the test framework change that's needed by 0002. - 0002 is the test for the code fix (0003). 0003 is the code change and the commit message explains the issue in detail. 0004 as said is a small enhancement which is a bit independent of the previous patches. Basically the issue is that without the fix crash recovery might fail relevant to tablespace. Here is the log after I run the tests in 0001/0002 without the 0003 fix. 2021-08-04 10:00:42.231 CST [875] FATAL: could not create directory "pg_tblspc/16385/PG_15_202107261/16390": No such file or directory 2021-08-04 10:00:42.231 CST [875] CONTEXT: WAL redo at 0/3001320 for Database/CREATE: copy dir base/1 to pg_tblspc/16385/PG_15_202107261/16390 > > I am not really convinced by Álvaro's claim that 0004 is a "fix"; I > think I'd call it an improvement. But either way I agree that could > just be committed. > > I haven't analyzed 0002 and 0003 yet. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > > -- Paul Guo (Vmware)
Re: Two patches to speed up pg_rewind.
Thanks for reviewing, please see the replies below. On Tue, Aug 17, 2021 at 3:47 PM Michael Paquier wrote: > > On Thu, Aug 05, 2021 at 06:18:03PM +0800, Paul Guo wrote: > > I modified the copy_file_range() patch using the below logic: > > > > If the first call of copy_file_range() fails with errno EXDEV or > > ENOTSUP, pg_rewind > > would not use copy_file_range() in rest code, and if copy_file_range() > > fails we > > fallback to use the previous read()+write() code logic for the file. > > I have looked at 0001, and I don't really like it. One argument > against this approach is that if pg_rewind fails in the middle of its > operation then we would have done a set of fsync() for nothing, with > the data folder still unusable. I would be curious to see some > numbers to see how much it matters with many physical files (say cases > with thousands of small relations?). > +/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data > */ > +#if defined(HAVE_SYNC_FILE_RANGE) > +#define PG_FLUSH_DATA_WORKS 1 > +#elif !defined(WIN32) && defined(MS_ASYNC) > +#define PG_FLUSH_DATA_WORKS 1 > +#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) > +#define PG_FLUSH_DATA_WORKS 1 > > This is wrong for the code frontend on platforms that may finish by > using MS_ASYNC, no? There is no such implementation in file_utils.c > but there is one in fd.c. Yes, it seems that we need to add the MS_ASYNC code (refer that in fd.c) in src/common/file_utils.c:pre_sync_fname(). > + fsync_fname("global/pg_control", false); > + fsync_fname("backup_label", false); > + if (access("recovery.conf", F_OK) == 0) > + fsync_fname("recovery.conf", false); > + if (access("postgresql.auto.conf", F_OK) == 0) > + fsync_fname("postgresql.auto.conf", false); > > This list is wrong on various aspects, no? This would miss custom > configuration files, or included files. I did not understand this. Can you please clarify? Anyway let me explain, here we fsync these files additionally because pg_rewind (possibly) modified these files after rewinding. These files may not be handled/logged in filemap pg_control action is FILE_ACTION_NONE backup_label is excluded recovery.conf is not logged in filemap postgresql.auto.conf may be logged but let's fsync this file for safety. > > - if (showprogress) > - pg_log_info("syncing target data directory"); > - sync_target_dir(); > - > /* Also update the standby configuration, if requested. */ > if (writerecoveryconf && !dry_run) > WriteRecoveryConfig(conn, datadir_target, > GenerateRecoveryConfig(conn, NULL)); > > + if (showprogress) > + pg_log_info("syncing target data directory"); > + perform_sync(filemap); > > Why inverting the order here? We need to synchronize the recoveryconf change finally in perform_sync(). > > + * Pre Linux 5.3 does not allow cross-fs copy_file_range() call > + * (return EXDEV). Some fs do not support copy_file_range() (return > + * ENOTSUP). Here we explicitly disable copy_file_range() for the > + * two scenarios. For other failures we still allow subsequent > + * copy_file_range() try. > + */ > + if (errno == ENOTSUP || errno == EXDEV) > + copy_file_range_support = false; > Are you sure that it is right to always cancel the support of > copy_file_range() after it does not work once? Couldn't it be > possible that things work properly depending on the tablespace being > worked on by pg_rewind? Ideally we should retry when first running into a symlink (e.g. tablespace, wal), but it seems not easy to do gracefully. > Having the facility for copy_file_range() in pg_rewind is not nice at > the end, and we are going to need a run-time check to fallback > dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}. > Hmm. What about locating all that in file_utils.c instead, with a > brand new routine name (pg_copy_file_range would be one choice)? We > still need the ./configure check, except that the conditions to use > the fallback implementation is in this routine, aka fallback on EXDEV, > ENOTSUP or !HAVE_COPY_FILE_RANGE. The backend may need to solve this > problem at some point, but logging and fd handling will likely just > locate that in fd.c, so having one routine for the purpose of all > frontends looks like a step in the right direction. Yes, seems better to make it generic.
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Looks like my previous reply was held for moderation (maybe due to my new email address). I configured my pg account today using the new email address. I guess this email would be held for moderation. I’m now replying my previous reply email and attaching the new patch series. On Jul 6, 2020, at 10:18 AM, Paul Guo mailto:gu...@vmware.com>> wrote: Thanks for the review. I’m now re-picking up the work. I modified the code following the comments. Besides, I tweaked the test code a bit. There are several things I’m not 100% sure. Please see my replies below. On Jan 27, 2020, at 11:24 PM, Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote: On 2020/01/15 19:18, Paul Guo wrote: I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result. Thanks for updating the patches! I started reading the 0003 patch. The approach that the 0003 patch uses is not the perfect solution. If the standby crashes after tblspc_redo() removes the directory and before its subsequent COMMIT record is replayed, PANIC error would occur since there can be some unresolved missing directory entries when we reach the consistent state. The problem would very rarely happen, though... Just idea; calling XLogFlush() to update the minimum recovery point just before tblspc_redo() performs destroy_tablespace_directories() may be safe and helpful for the problem? Yes looks like an issue. My understanding is the below scenario. XLogLogMissingDir() XLogFlush() in redo (e.g. in a commit redo). <- create a minimum recovery point (we call it LSN_A). tblspc_redo()->XLogForgetMissingDir() <- If we panic immediately after we remove the directory in tblspc_redo() <- when we do replay during crash-recovery, we will check consistency at LSN_A and thus PANIC inXLogCheckMissingDirs() commit We should add a XLogFlush() in tblspc_redo(). This brings several other questions to my minds also. 1. Should we call XLogFlush() in dbase_redo() for XLOG_DBASE_DROP also? It calls both XLogDropDatabase() and XLogForgetMissingDir, which seem to have this issue also? 2. xact_redo_abort() calls DropRelationFiles() also. Why do not we call XLogFlush() there? - appendStringInfo(buf, "copy dir %u/%u to %u/%u", - xlrec->src_tablespace_id, xlrec->src_db_id, - xlrec->tablespace_id, xlrec->db_id); + dbpath1 = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id); + dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); + appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2); + pfree(dbpath2); + pfree(dbpath1); If the patch is for the bug fix and would be back-ported, the above change would lead to change pg_waldump's output for CREATE/DROP DATABASE between minor versions. IMO it's better to avoid such change and separate the above as a separate patch only for master. I know we do not want wal format between minor releases, but does wal description string change between minor releases affect users? Anyone I’ll extract this part into a separate patch in the series since this change is actually independent of the other changes.. - appendStringInfo(buf, " %u/%u", - xlrec->tablespace_ids[i], xlrec->db_id); + { + dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]); + appendStringInfo(buf, "%s", dbpath1); + pfree(dbpath1); + } Same as above. BTW, the above "%s" should be " %s", i.e., a space character needs to be appended to the head of "%s”. OK + get_parent_directory(parent_path); + if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) + { + XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path); The third argument of XLogLogMissingDir() should be parent_path instead of dst_path? The argument is for debug message printing so both should be fine, but admittedly we are logging for the tablespace directory so parent_path might be better. + if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL) + elog(DEBUG2, "dir %s tablespace %d database %d is not missing", + path, spcNode, dbNode); I think that this elog() is useless and rather confusing. OK. Modified. + XLogForgetMissingDir(xlrec->ts_id, InvalidOid, ""); The third argument should be set to the actual path instead of an empty string. Otherwise XLogForgetMissingDir() may emit a confusing DEBUG2 message. Or the third argument of XLogForgetMissingDir() should be removed and the path in the DEBUG2 message should be calculated from the spcNode and dbNode in the hash entry in XLogForgetMissingDir(). I’m now removing the third argument. Use GetDatabasePath() to get the path if database did I snot InvalidOid. +#include "common/file_perm.h" This seems not necessary. Right. v10-0001-Support-node-initialization-from-backu
Two fsync related performance issues?
Hello hackers, 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code. if (ControlFile->state != DB_SHUTDOWNED && ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) { RemoveTempXlogFiles(); SyncDataDirectory(); } I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to start wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. 2. CheckPointTwoPhase() This may be a small issue. See the code below, for (i = 0; i < TwoPhaseState->numPrepXacts; i++) RecreateTwoPhaseFile(gxact->xid, buf, len); RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit. Any thoughts? Regards.
Re: Two fsync related performance issues?
Thanks for the replies. On Tue, May 12, 2020 at 2:04 PM Michael Paquier wrote: > On Tue, May 12, 2020 at 12:55:37PM +0900, Fujii Masao wrote: > > On 2020/05/12 9:42, Paul Guo wrote: > >> 1. StartupXLOG() does fsync on the whole data directory early in > >> the crash recovery. I'm wondering if we could skip some > >> directories (at least the pg_log/, table directories) since wal, > >> etc could ensure consistency. > > > > I agree that we can skip log directory but I'm not sure if skipping > > table directory is really safe. Also ISTM that we can skip the > directories > > that those contents are removed or zeroed during recovery, > > for example, pg_snapshots, pg_substrans, etc. > > Basically excludeDirContents[] as of basebackup.c. > table directories & wal fsync probably dominates the fsync time. Do we know any possible real scenario that requires table directory fsync?
Re: Batch insert in CTAS/MatView code
ect a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12; 14.214.46 -1.83% create table tt as select * from t13; 11.88 12.05 -1.43% create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13; 3.173.25-2.52% create table tt as select * from t14; 2.933.12-6.48% create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14; scenario2: all related kernel caches are populated after previous testing. baselinepatch diff% SQL 9.6 4.9748.23% create table tt as select * from t11; 10.41 5.3248.90% create table tt as select a,b,c from t11; 9.129.52-4.38% create table tt as select * from t12; 9.668.6 10.97% create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12; 13.56 13.6-0.30% create table tt as select * from t13; 11.36 11.7-2.99% create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13; 3.083.13-1.62% create table tt as select * from t14; 2.953.03-2.71% create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14; >From above we can get some tentative conclusions: 1. t11: For short-size tables, batch insert improves much (40%+). 2. t12: For BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but for VirtualTupleTableSlot it improves 10.x%. If we look at execTuples.c, it looks like this is quite relevant to additional memory copy. It seems that VirtualTupleTableSlot is more common than the BufferHeapTupleTableSlot so probably the current code should be fine for most real cases. Or it's possible to determine multi-insert also according to the input slot tuple but this seems to be ugly in code. Or continue to lower the threshold a bit so that "create table tt as select * from t12;" also improves although this hurts the VirtualTupleTableSlot case. 3. for t13, new code still uses single insert so the difference should be small. I just want to see the regression when even we use "single insert". 4. For toast case t14, the degradation is small, not a big deal. By the way, did we try or think about allow better prefetch (on Linux) for seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the kernel readahead window. Suppose this should help if seq tuple handling is faster than default kernel readahead setting. v2 patch is attached. On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas wrote: > On 06/03/2019 22:06, Paul Guo wrote: > > The patch also modifies heap_multi_insert() a bit to do a bit further > > code-level optimization by using static memory, instead of using memory > > context and dynamic allocation. > > If toasting is required, heap_prepare_insert() creates a palloc'd tuple. > That is still leaked to the current memory context. > > Leaking into the current memory context is not a bad thing, because > resetting a memory context is faster than doing a lot of pfree() calls. > The callers just need to be prepared for that, and use a short-lived > memory context. > > > By the way, while looking at the code, I noticed that there are 9 local > > arrays with large length in toast_insert_or_update() which seems to be a > > risk of stack overflow. Maybe we should put it as static or global. > > Hmm. We currently reserve 512 kB between the kernel's limit, and the > limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those > arrays add up to 52800 bytes on a 64-bit maching, if I did my math > right. So there's still a lot of headroom. I agree that it nevertheless > seems a bit excessive, though. > > > With the patch, > > > > Time: 4728.142 ms (00:04.728) > > Time: 14203.983 ms (00:14.204) > > Time: 1008.669 ms (00:01.009) > > > > Baseline, > > Time: 11096.146 ms (00:11.096) > > Time: 13106.741 ms (00:13.107) > > Time: 1100.174 ms (00:01.100) > > Nice speedup! > > > While for toast and large column size there is < 10% decrease but for > > small column size the improvement is super good. Actually if I hardcode > > the batch count as 4 all test cases are better but the improvement for > > small column size is smaller than that with current patch. Pretty much > > the number 4 is quite case specific so I can not hardcode that in the > > patch. Of course we could further tune that but the current value seems > > to be a good trade-off? > > Have you done any profiling, on why the multi-insert is slower with > large tuples? In principle, I don't see why it should be slower. > > - Heikki > v2-0001-Heap-batch-insert-for-CTAS-MatView.patch Description: Binary data
Re: Batch insert in CTAS/MatView code
On Mon, Jun 17, 2019 at 8:53 PM Paul Guo wrote: > Hi all, > > I've been working other things until recently I restarted the work, > profiling & refactoring the code. > It's been a long time since the last patch was proposed. The new patch has > now been firstly refactored due to > 4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible, > finish split of existing slot type). > > Now that TupleTableSlot, instead of HeapTuple is one argument of > intorel_receive() so we can not get the > tuple length directly. This patch now gets the tuple length if we know all > columns are with fixed widths, else > we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES > (defined as 1000) tuples > and use for the total length of tuples in a batch. > > I noticed that to do batch insert, we might need additional memory copy > sometimes comparing with "single insert" > (that should be the reason that we previously saw a bit regressions) so a > good solution seems to fall back > to "single insert" if the tuple length is larger than a threshold. I set > this as 2000 after quick testing. > > To make test stable and strict, I run checkpoint before each ctas, the > test script looks like this: > > checkpoint; > \timing > create table tt as select a,b,c from t11; > \timing > drop table tt; > > Also previously I just tested the BufferHeapTupleTableSlot (i.e. create > table tt as select * from t11), > this time I test VirtualTupleTableSlot (i.e. create table tt as select > a,b,c from t11) additionally. > It seems that VirtualTupleTableSlot is very common in real cases. > > I tested four kinds of tables, see below SQLs. > > -- tuples with small size. > create table t11 (a int, b int, c int, d int); > insert into t11 select s,s,s,s from generate_series(1, 1000) s; > analyze t11; > > -- tuples that are untoasted and tuple size is 1984 bytes. > create table t12 (a name, b name, c name, d name, e name, f name, g name, > h name, i name, j name, k name, l name, m name, n name, o name, p name, q > name, r name, s name, t name, u name, v name, w name, x name, y name, z > name, a1 name, a2 name, a3 name, a4 name, a5 name); > insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', > 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', > 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 50); > analyze t12; > > -- tuples that are untoasted and tuple size is 2112 bytes. > create table t13 (a name, b name, c name, d name, e name, f name, g name, > h name, i name, j name, k name, l name, m name, n name, o name, p name, q > name, r name, s name, t name, u name, v name, w name, x name, y name, z > name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name); > insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', > 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', > 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 50); > analyze t13; > > -- tuples that are toastable and tuple compressed size is 1084. > create table t14 (a text, b text, c text, d text, e text, f text, g text, > h text, i text, j text, k text, l text, m text, n text, o text, p text, q > text, r text, s text, t text, u text, v text, w text, x text, y text, z > text); > insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, > i, i, i, i, i, i, i, i, i from (select repeat('123456789', 1) from > generate_series(1,5000)) i; > analyze t14; > > > I also tested two scenarios for each testing. > > One is to clean up all kernel caches (page & inode & dentry on Linux) > using the command below and then run the test, > sync; echo 3 > /proc/sys/vm/drop_caches > After running all tests all relation files will be in kernel cache (my > test system memory is large enough to accommodate all relation files), > then I run the tests again. I run like this because in real scenario the > result of the test should be among the two results. Also I rerun > each test and finally I calculate
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Mon, May 27, 2019 at 9:39 PM Paul Guo wrote: > > > On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI < > horiguchi.kyot...@lab.ntt.co.jp> wrote: > >> Hello. >> >> At Mon, 13 May 2019 17:37:50 +0800, Paul Guo wrote in < >> caeet0zf9yn4daxyuflzocayyxuff1ms_oqwea+rwv3gha5q...@mail.gmail.com> >> > Thanks for the reply. >> > >> > On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI < >> > horiguchi.kyot...@lab.ntt.co.jp> wrote: >> > >> > > >> > > + if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) >> > > + { >> > > >> > > This patch is allowing missing source and destination directory >> > > even in consistent state. I don't think it is safe. >> > > >> > >> > I do not understand this. Can you elaborate? >> >> Suppose we were recoverying based on a backup at LSN1 targeting >> to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2 >> is called as "consistency point", before where the database is >> not consistent. It's because we are applying WAL recored older >> than those that were already applied in the second trial. The >> same can be said for crash recovery, where LSN1 is the latest >> checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN. >> >> Creation of an existing directory or dropping of a non-existent >> directory are apparently inconsistent or "broken" so we should >> stop recovery when seeing such WAL records while database is in >> consistent state. >> > > This seems to be hard to detect. I thought using invalid_page mechanism > long ago, > but it seems to be hard to fully detect a dropped tablespace. > > > > > 2) Fixed dbase_desc(). Now the xlog output looks correct. >> > > > >> > > > rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn: >> > > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to >> > > > pg_tblspc/16384/PG_12_201904281/16386 >> > > > >> > > > rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn: >> > > > 0/01638EB8, prev 0/01638E40, desc: DROP dir >> > > > pg_tblspc/16384/PG_12_201904281/16386 >> > > >> > > WAL records don't convey such information. The previous >> > > description seems right to me. >> > > >> > >> > 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for >> > Database/CREATE: copy dir 1663/1 to 65546/65547 >> > The directories are definitely wrong and misleading. >> >> The original description is right in the light of how the server >> recognizes. The record exactly says that "copy dir 1663/1 to >> 65546/65547" and the latter path is converted in filesystem layer >> via a symlink. >> > > In either $PG_DATA/pg_tblspc or symlinked real tablespace directory, > there is an additional directory like PG_12_201905221 between > tablespace oid and database oid. See the directory layout as below, > so the directory info in xlog dump output was not correct. > > $ ls -lh data/pg_tblspc/ > > > total 0 > > > lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2 > > > $ ls -lh /tmp/2 > > > total 0 > > > drwx--. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221 > >> >> >> > > > > Also I'd suggest we should use pg_mkdir_p() in >> > > TablespaceCreateDbspace() >> > > > > to replace >> > > > > the code block includes a lot of >> > > > > get_parent_directory(), MakePGDirectory(), etc even it >> > > > > is not fixing a bug since pg_mkdir_p() code change seems to be >> more >> > > > > graceful and simpler. >> > > >> > > But I don't agree to this. pg_mkdir_p goes above two-parents up, >> > > which would be unwanted here. >> > > >> > > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'. >> > This change just makes the code concise. Though in theory the change is >> not >> > needed. >> >> We don't want to create tablespace direcotory after concurrent >> DROPing, as the comment just above is saying: >> >> | * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE >> | * or TablespaceCreateDbspace is running concurrently. >> >> If the concurrent DROP TABLESPACE destroyed the grand parent >> directory, we mustn't create it again. >> > > Yes, this is a good reason to keep the original code. Thanks. > > By the way, based on your previous test patch I added another test which > could easily detect > the missing src directory case. > > I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory, but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate. Thanks. v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
Rebased, aligned with recent changes in pg_rewind/pg_basebackup and then retested. Thanks. On Mon, Jul 1, 2019 at 7:35 PM Thomas Munro wrote: > On Fri, Apr 19, 2019 at 3:41 PM Paul Guo wrote: > > I updated the patches as attached following previous discussions. > > Hi Paul, > > Could we please have a fresh rebase now that the CF is here? > > Thanks, > > -- > Thomas Munro > > https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=mxictY8xxFIFvsyxFYx4bXwF4PfnGWWRuYLLX4R1yhs&s=qvC2BI2OhKkBz71FO1w2XNy6dvfhIeGHT3X0yU3XDlU&e= > v3-0001-Extact-common-functions-from-pg_basebackup-into-s.patch Description: Binary data v3-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data v3-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On Tue, Jul 2, 2019 at 12:35 AM Alvaro Herrera wrote: > On 2019-Apr-19, Paul Guo wrote: > > > The below patch runs single mode Postgres if needed to make sure the > target > > is cleanly shutdown. A new option is added (off by default). > > v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch > > Why do we need an option for this? Is there a reason not to do this > unconditionally? > There is concern about this (see previous emails in this thread). On greenplum (MPP DB based on Postgres), we unconditionally do this. I'm not sure about usually how Postgres users do this when there is an unclean shutdown, but providing an option seem to be safer to avoid breaking existing script/service whatever. If many people think this option is unnecessary, I'm fine to remove the option and keep the code logic.
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
Yes, the patches changed Makefile so that pg_rewind and pg_basebackup could use some common code, but for Windows build, I'm not sure where are those window build files. Does anyone know about that? Thanks. On Tue, Jul 9, 2019 at 6:55 AM Thomas Munro wrote: > On Tue, Jul 2, 2019 at 5:46 PM Paul Guo wrote: > > Rebased, aligned with recent changes in pg_rewind/pg_basebackup and then > retested. Thanks. > > Hi Paul, > > A minor build problem on Windows: > > src/bin/pg_rewind/pg_rewind.c(32): fatal error C1083: Cannot open > include file: 'backup_common.h': No such file or directory > [C:\projects\postgresql\pg_rewind.vcxproj] > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.appveyor.com_project_postgresql-2Dcfbot_postgresql_build_1.0.46422&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=9FvwTFotsG_UdMt_xNEvMpM7_kKgR-hV4Fg8mnseaNM&s=cieAr5np_1qgfD3tXImqOJcdIpBzgBco-pm1TLLUUuI&e= > > https://urldefense.proofpoint.com/v2/url?u=http-3A__cfbot.cputube.org_paul-2Dguo.html&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=9FvwTFotsG_UdMt_xNEvMpM7_kKgR-hV4Fg8mnseaNM&s=RkCg3MktPW2gi4I_fAkHqI8i3anSADrz0MXq2VaqmFE&e= > > -- > Thomas Munro > > https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=9FvwTFotsG_UdMt_xNEvMpM7_kKgR-hV4Fg8mnseaNM&s=N8IZBBSK2EyREasMrbBQqHTHJwe1NbCBLEsxP-8C1Hk&e= >
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On Wed, Jul 10, 2019 at 3:28 PM Michael Paquier wrote: > On Tue, Jul 09, 2019 at 10:48:49PM +0800, Paul Guo wrote: > > Yes, the patches changed Makefile so that pg_rewind and pg_basebackup > could > > use some common code, but for Windows build, I'm not sure where are those > > window build files. Does anyone know about that? Thanks. > > The VS scripts are located in src/tools/msvc/. You will likely need > to tweak things like $frontend_extraincludes or variables in the same > area for this patch (please see Mkvcbuild.pm). > Thanks. Both Mkvcbuild.pm and pg_rewind/Makefile are modified to make Windows build pass in a local environment (Hopefully this passes the CI testing), also now pg_rewind/Makefile does not create soft link for backup_common.h anymore. Instead -I is used to specify the header directory. I also noticed that doc change is needed so modified documents for the two new options accordingly. Please see the attached new patches. v4-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data v4-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data v4-0001-Extact-common-functions-from-pg_basebackup-into-s.patch Description: Binary data
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Mon, Jul 8, 2019 at 11:16 AM Thomas Munro wrote: > On Wed, Jun 19, 2019 at 7:22 PM Paul Guo wrote: > > I updated the patch to v3. In this version, we skip the error if copydir > fails due to missing src/dst directory, > > but to make sure the ignoring is legal, I add a simple log/forget > mechanism (Using List) similar to the xlog invalid page > > checking mechanism. Two tap tests are included. One is actually from a > previous patch by Kyotaro in this > > email thread and another is added by me. In addition, dbase_desc() is > fixed to make the message accurate. > > Hello Paul, > > FYI t/011_crash_recovery.pl is failing consistently on Travis CI with > this patch applied: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_postgresql-2Dcfbot_postgresql_builds_555368907&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=ABylo8AVfubiiYVbCBSgmNnHEMJhMqGXx5c0hkug7Vw&s=5h4m_JhrZwZqsRsu1CHCD3W2eBl14mT8jWLFsj2-bJ4&e= > > > This failure is because the previous v3 patch does not align with a recent patch commit 660a2b19038b2f6b9f6bcb2c3297a47d5e3557a8 Author: Noah Misch Date: Fri Jun 21 20:34:23 2019 -0700 Consolidate methods for translating a Perl path to a Windows path. My patch uses TestLib::real_dir which is now replaced with TestLib::perl2host in the above commit. I've updated the patch to v4 to make my code align. Now the test passes in my local environment. Please see the attached v4 patch. Thanks. v4-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch Description: Binary data
How to reclaim the space of dropped columns of a table?
Hello hackers, I have been having a question about this with no answer from various sources . As known after dropping a column using 'alter table', table is not rewritten and vacuum full does not remove them also (still see the dropped column in pg_attribute). PG document says: https://www.postgresql.org/docs/current/sql-altertable.html "To force immediate reclamation of space occupied by a dropped column, you can execute one of the forms of ALTER TABLE that performs a rewrite of the whole table. This results in reconstructing each row with the dropped column replaced by a null value." This seems to a bit vague for users (how to rewrite but keep the table definition) and it seems to still keep the dropped columns (though with null). Isn't it better to leave the functionality to command like 'vacuum full' to completely remove the dropped columns (i.e. no dropped columns in pg_attributes and no null values for dropped columns for a table)? Thanks.
Re: Possible race condition in pg_mkdir_p()?
On Thu, Jul 18, 2019 at 4:57 PM Michael Paquier wrote: > On Thu, Jul 18, 2019 at 04:17:22PM +0800, Ning Yu wrote: > > This is still wrong with current code logic, because when the statusdir > is > > a file the errno is also EEXIST, but it can pass the check here. Even if > > we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here > is > > still wrong. > > Would you like to send a patch? > Michael, we'll send out the patch later. Checked code, it seems that there is another related mkdir() issue. MakePGDirectory() is actually a syscall mkdir(), and manpage says the errno meaning of EEXIST, EEXIST pathname already exists (not necessarily as a directory). This includes the case where pathname is a symbolic link, dangling or not. However it looks like some callers do not use that correctly, e.g. if (MakePGDirectory(directory) < 0) { if (errno == EEXIST) return; OR if (MakePGDirectory(parentdir) < 0 && errno != EEXIST) i.e. we should better use stat(path) && S_ISDIR(buf) && errno == EEXIST to replace errno == EEXIST. One possible fix is to add an argument like ignore_created (in case some callers want to fail if the path has been created) in MakePGDirectory() and then add that code logic into it.
Re: pg_ugprade test failure on data set with column with default value with type bit/varbit
Thanks. I updated the patch as attached. Double-checked those tests passed. 2018-07-30 9:38 GMT+08:00 Thomas Munro : > On Thu, May 17, 2018 at 8:20 PM, Paul Guo wrote: > > Thanks. I tentatively submitted a patch (See the attachment). > > Hi Paul, > > It looks like you missed a couple of changes in the contrib/btree_gist > bit and varbit tests, so make check-world fails: > > - Index Cond: ((a >= B'100'::"bit") AND (a <= B'101'::"bit")) > + Index Cond: ((a >= '100'::"bit") AND (a <= '101'::"bit")) > > -- > Thomas Munro > http://www.enterprisedb.com > 0001-Fix-pg_upgrade-test-failure-caused-by-the-DDL-below.v2.patch Description: Binary data
[Patch] Create a new session in postmaster by calling setsid()
Hello, Recently I encountered an issue during testing and rootcaused it as the title mentioned. postmaster children have done this (creating a new session) by calling InitPostmasterChild(), but postmaster itself does not. This could lead to some issues (e..g signal handling). The test script below is a simple example. $ cat test.sh pg_ctl -D ./data -l stop.log stop sleep 2 -- wait for pg down. pg_ctl -D ./data -l start.log start sleep 2 -- wait for pg up. echo "pg_sleep()-ing" psql -d postgres -c 'select pg_sleep(1000)' -- press ctrl+c, then you will see postmaster and its children are all gone. Long ago PG has pmdaemonize() to daemonize postmaster which of course create a new session. That code was removed in the patch below. commit f7ea6beaf4ca02b8e6dc576255e35a5b86035cb9 Author: Heikki Linnakangas Date: Mon Jul 4 14:35:44 2011 +0300 Remove silent_mode. You get the same functionality with "pg_ctl -l postmaster.log", or nohup. There was a small issue with LINUX_OOM_ADJ and silent_mode, namely that with silent_mode the postmaster process incorrectly used the OOM settings meant for backend processes. We certainly could've fixed that directly, but since silent_mode was redundant anyway, we might as well just remove it. Here is the related discussion about the patch. It seems that is related to oom score setting in fork_process(). https://www.postgresql.org/message-id/4E046EA5.8070101%40enterprisedb.com Personally I still like pg being a daemon, but probably I do not know some real cases which do not like daemon. If we do not go back to pgdaemonize() with further fix of fork_process() (assume I understand correctly) we could simply call setsid() in postmaster code to fix the issue above. I added a simple a patch as attached. Thanks. 0001-Create-a-new-session-in-postmaster-by-calling-setsid.patch Description: Binary data
Re: [Patch] Create a new session in postmaster by calling setsid()
On Thu, Aug 2, 2018 at 10:30 PM, Tom Lane wrote: > Paul Guo writes: > > [ make the postmaster execute setsid() too ] > > I'm a bit skeptical of this proposal. Forcing the postmaster to > dissociate from its controlling terminal is a good thing in some > scenarios, but probably less good in others, and manual postmaster > starts are probably mostly in the latter class. > > I wonder whether having "pg_ctl start" do a setsid() would accomplish > the same result with less chance of breaking debugging usages. > > regards, tom lane > Yes, if considering the case of starting postmaster manually, we can not create a new session in postmaster, so pg_ctl seems to be a good place for setsid() call. Attached a newer patch. Thanks. 0001-Create-a-new-session-for-postmaster-in-pg_ctl-by-cal.patch Description: Binary data
Re: Parallelize stream replication process
> On Sep 16, 2020, at 11:15 AM, Li Japin wrote: > > > >> On Sep 15, 2020, at 3:41 PM, Fujii Masao wrote: >> >> >> >> On 2020/09/15 13:41, Bharath Rupireddy wrote: >>> On Tue, Sep 15, 2020 at 9:27 AM Li Japin wrote: For now, postgres use single process to send, receive and replay the WAL when we use stream replication, is there any point to parallelize this process? If it does, how do we start? Any thoughts? >> >> Probably this is another parallelism than what you're thinking, >> but I was thinking to start up walwriter process in the standby server >> and make it fsync the streamed WAL data. This means that we leave >> a part of tasks of walreceiver process to walwriter. Walreceiver >> performs WAL receive and write, and walwriter performs WAL flush, >> in parallel. I'm just expecting that this change would improve >> the replication performance, e.g., reduce the time to wait for >> sync replication. Yes this should be able to improve that in theory. >> >> Without this change (i.e., originally), only walreceiver performs >> WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data, >> it cannot receive newly-arrived WAL data. If WAL flush takes a time, >> which means that the time to wait for sync replication in the primary >> would be enlarged. >> >> Regards, >> >> -- >> Fujii Masao >> Advanced Computing Technology Center >> Research and Development Headquarters >> NTT DATA CORPORATION > > Yeah, this might be a direction. > > Now I am thinking about how to parallel WAL log replay. If we can improve the > efficiency > of replay, then we can shorten the database recovery time (streaming > replication or database > crash recovery). Yes, parallelization should be able to help the scenario that cpu util rate is 100% or if it is not 100% but continues blocking in some operations which could be improved by running in parallel. There are a lot of scenarios of WAL replay (memory operation, system calls, locking, etc). I do not have the experience of real production environment, so not sure whether or how the recovery suffer, but I believe parallel recovery should help to accelerate and help failover which is quite important especially in cloud environment. To do this may need to carefully analyze the dependency of various WALL at first. It won’t be a small effort. I’ve heard some databases have implemented this though not sure how much that helps. For parallel wal receiver/sender, its functionality is simple so after decoupling the fsync operation to a separate process not sure how beneficial parallelization would be (surely if wal receiver/sender Is 100% cpu utilized that is needed). > > For streaming replication, we may need to improve the transmission of WAL > logs to improve > the entire recovery process. > > I’m not sure if this is correct. > > Regards, > Japin Li. >
[PATCH] Use access() to check file existence in GetNewRelFileNode().
Previous code uses BasicOpenFile() + close(). access() should be faster than BasicOpenFile()+close() and access() should be more correct since BasicOpenFile() could fail for various cases (e.g. due to file permission, etc) even the file exists. access() is supported on Linux/Unix. I do not have a Windows dev environment, but MSDN tells me that access() is supported on Windows also and there have been access() call in the workspace, so I assume there is no portability issue. Thanks. 0001-Use-access-to-check-file-existence-in-GetNewRelFileN.patch Description: Binary data
Re: pg_ugprade test failure on data set with column with default value with type bit/varbit
Thanks. I tentatively submitted a patch (See the attachment). By the way, current pg_upgrade test script depends on the left data on test database, but it seems that a lot of tables are dropped in those test SQL files so this affects the pg_upgrade test coverage much. Maybe this needs to be addressed in the future. (Maybe when anyone checks in test cases pg_upgrade needs to be considered also?) 2018-05-11 3:08 GMT+08:00 Robert Haas : > On Fri, Mar 30, 2018 at 5:36 AM, Paul Guo wrote: > > There is no diff in functionality of the dump SQLs, but it is annoying. > The > > simple patch below could fix this. Thanks. > > > > --- a/src/backend/utils/adt/ruleutils.c > > +++ b/src/backend/utils/adt/ruleutils.c > > @@ -9389,7 +9389,7 @@ get_const_expr(Const *constval, deparse_context > > *context, int showtype) > > > > case BITOID: > > case VARBITOID: > > - appendStringInfo(buf, "B'%s'", extval); > > + appendStringInfo(buf, "'%s'", extval); > > break; > > > > case BOOLOID: > > My first reaction was to guess that this would be unsafe, but looking > it over I don't see a problem. For the other types handled in that > switch statement, we rely on the custom representation to avoid > needing a typecast, but it seems that for BITOID and VARBITOID we > insert a typecast no matter what. So maybe the presence of absence of > the "B" makes no difference. > > This logic seems to have been added by commit > c828ec88205a232a9789f157d8cf9c3d82f85152, Peter Eisentraut, vintage > 2002. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > 0001-Fix-pg_upgrade-test-failure-caused-by-the-DDL-below.patch Description: Binary data
Re: [PATCH] Use access() to check file existence in GetNewRelFileNode().
F_OK seems to be better than R_OK because we want to check file existence (not read permission) before creating the relation file with the path later. 2018-05-17 17:09 GMT+08:00 Michael Paquier : > On Thu, May 17, 2018 at 04:09:27PM +0800, Paul Guo wrote: > > Previous code uses BasicOpenFile() + close(). > > > > access() should be faster than BasicOpenFile()+close() and access() > > should be more correct since BasicOpenFile() could fail for various > > cases (e.g. due to file permission, etc) even the file exists. > > Failing because of file permissions would be correct. There have been > cases in the past, particularly on Windows, where anti-virus softwares > wildly scan files, causing EACCES on various points of the data folder. > > > access() is supported on Linux/Unix. I do not have a Windows dev > > environment, but MSDN tells me that access() is supported on Windows also > > and there have been access() call in the workspace, so I assume there is > no > > portability issue. > > Yes, access() is spread already in the core code. > > -fd = BasicOpenFile(rpath, O_RDONLY | PG_BINARY); > > -if (fd >= 0) > +if (access(rpath, F_OK) == 0) > > What you are looking for here is R_OK, no? > -- > Michael >