Re: Patch: Range Merge Join
Dear all, thanks for the feedback! We had a closer look at the previous patches and the CustomScan infrastructure. Compared to the previous patch, we do not (directly) focus on joins with the overlap (&&) condition in this patch. Instead we consider joins with containment (@>) between a range and an element, and joins with conditions over scalars of the form "right.element BETWEEN left.start AND left.end", and more generally left.start >(=) right.element AND right.element <(=) left.end. We call such conditions range conditions and these conditions can be combined with equality conditions in the Range Merge Join. The Range Merge Join can use (optional) equality conditions and one range condition of the form shown above. In this case the inputs are sorted first by the attributes used for equality and then one input by the range (or start in the case of scalars) and the other input by the element. The Range Merge Join is then a simple extension of the Merge Join that in addition to the (optional) equality attributes also uses the range condition in the merge join states. This is similar to an index-nested loop with scalars for cases when the relation containing the element has an index on the equality attributes followed by the element. The Range Merge Join uses sorting and thus does not require the index for this purpose and performs better. The patch uses the optimizer estimates to evaluate if the Range Merge Join is beneficial as compared to other execution strategies, but when no equality attributes are present, it becomes the only efficient option for the above range conditions. If a join contains multiple range conditions, then based on the estimates the most effective strategy is chosen for the Range Merge Join. Although we do not directly focus on joins with the overlap (&&) condition between two ranges, we show in [1] that these joins can be evaluated using the union (UNION ALL) of two joins with a range condition, where intuitively, one tests that the start of one input falls within the range of the other and vice versa. We evaluated this using regular (B-tree) indices and compare it to joins with the overlap (&&) condition using GiST, SP-GiST and others, and found that it performs better. The Range Merge Join would improve this further and would not require the creation of an index. We did not consider an implementation as a CustomScan, as we feel the join is rather general, can be implemented using a small extension of the existing Merge Join, and would require a substantial duplication of the Merge Join code. Kind regards, Thomas, Anton, Johann, Michael, Peter [1] https://doi.org/10.1007/s00778-021-00692-3 (open access) Am Di., 5. Okt. 2021 um 02:30 Uhr schrieb Jaime Casanova < jcasa...@systemguards.com.ec>: > > On Mon, Oct 04, 2021 at 04:27:54PM -0500, Jaime Casanova wrote: > >> On Thu, Jun 10, 2021 at 07:14:32PM -0700, Jeff Davis wrote: > >> > > >> > I'll start with the reason I set the work down before: it did not work > >> > well with multiple join keys. That might be fine, but I also started > >> > thinking it was specialized enough that I wanted to look into doing it > >> > as an extension using the CustomScan mechanism. > >> > > >> > Do you have any solution to working better with multiple join keys? > And > >> > do you have thoughts on whether it would be a good candidate for the > >> > CustomScan extension mechanism, which would make it easier to > >> > experiment with? > >> > > >> > >> Hi, > >> > >> It seems this has been stalled since jun-2021. I intend mark this as > >> RwF unless someone speaks in the next hour or so. > >> > > Thomas wrote me: > > > Hi, > > > > I registered this patch for the commitfest in july. It had not been > reviewed and moved to the next CF. I still like to submit it. > > > > Regards, > > Thomas > > > > Just for clarification RwF doesn't imply reject of the patch. > Nevertheless, given that there has been no real review I will mark this > patch as "Waiting on Author" and move it to the next CF. > > Meanwhile, cfbot (aka http://commitfest.cputube.org) says this doesn't > compile. Here is a little patch to fix the compilation errors, after > that it passes all tests in make check-world. > > Also attached a rebased version of your patch with the fixes so we turn > cfbot entry green again > > -- > Jaime Casanova > Director de Servicios Profesionales > SystemGuards - Consultores de PostgreSQL > postgres.patch Description: Binary data
Re: Patch: Range Merge Join
Thank you for the feedback and sorry for the oversight. I fixed the bug and attached a new version of the patch. Kind Regards, Thomas Am Mi., 17. Nov. 2021 um 15:03 Uhr schrieb Daniel Gustafsson < dan...@yesql.se>: > This patch fails to compile due to an incorrect function name in an > assertion: > > nodeMergejoin.c:297:9: warning: implicit declaration of function > 'list_legth' is invalid in C99 [-Wimplicit-function-declaration] > Assert(list_legth(node->rangeclause) < 3); > ^ > > -- > Daniel Gustafsson https://vmware.com/ > > postgres-rmj_11-17-21.patch Description: Binary data
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Tue, 25 Oct 2022 01:03:23 +0100, Jacob Champion said: > I'd like to try to get this conversation started again. To pique > interest I've attached a new version of 0001, which implements > `sslrootcert=system` instead as suggested upthread. In 0002 I went > further and switched the default sslmode to `verify-full` when using > the system CA roots, because I feel pretty strongly that anyone > interested in using public CA systems is also interested in verifying > hostnames. (Otherwise, why make the switch?) Yeah I agree that not forcing verify-full when using system CAs is a giant foot-gun, and many will stop configuring just until it works. Is there any argument for not checking hostname when using a CA pool for which literally anyone can create a cert that passes? It makes sense for self-signed, or "don't care", since that provides at least protection against passive attacks, but if someone went out of their way to get a third party signed cert, then it doesn't. One downside to this approach is that now one option will change the value of another option. For SSL mode (my rejected patch :-) ) that makes maybe some more sense. For users, what is more surprising: A foot-gun that sounds safe, or one option that overrides another? -- typedef struct me_s { char name[] = { "Thomas Habets" }; char email[] = { "tho...@habets.se" }; char kernel[]= { "Linux" }; char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt"; }; char pgp[] = { "9907 8698 8A24 F52F 1C2E 87F6 39A4 9EEA 460A 0169" }; char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" }; } me_t;
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Mon, 31 Oct 2022 23:36:16 +, Jacob Champion said: >> I wanted to get feedback on the approach before wordsmithing too >> much. > > I've added this to tomorrow's CF [1]. Thomas, if you get (or already > have) a PG community username, I can add you as an author. Sweet. I just created an account with username `habets`. -- typedef struct me_s { char name[] = { "Thomas Habets" }; char email[] = { "tho...@habets.se" }; char kernel[]= { "Linux" }; char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt"; }; char pgp[] = { "9907 8698 8A24 F52F 1C2E 87F6 39A4 9EEA 460A 0169" }; char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" }; } me_t;
Re: Patch: Range Merge Join
Thank you for the feedback. I removed the redundant comments from the patch and added this thread to the July CF [1]. Best Regards, Thomas Mannhart [1] https://commitfest.postgresql.org/33/3160/ Am Do., 10. Juni 2021 um 05:10 Uhr schrieb David Rowley < dgrowle...@gmail.com>: > On Thu, 10 Jun 2021 at 03:05, Thomas wrote: > > We have implemented the Range Merge Join algorithm by extending the > > existing Merge Join to also support range conditions, i.e., BETWEEN-AND > > or @> (containment for range types). > > It shouldn't be a blocker for you, but just so you're aware, there was > a previous proposal for this in [1] and a patch in [2]. I've include > Jeff here just so he's aware of this. Jeff may wish to state his > intentions with his own patch. It's been a few years now. > > I only just glanced over the patch. I'd suggest getting rid of the /* > Thomas */ comments. We use git, so if you need an audit trail about > changes then you'll find it in git blame. If you have those for an > internal audit trail then you should consider using git. No committer > would commit those to PostgreSQL, so they might as well disappear. > > For further review, please add the patch to the July commitfest [3]. > We should be branching for pg15 sometime before the start of July. > There will be more focus on new patches around that time. Further > details in [4]. > > Also, I see this if your first post to this list, so welcome, and > thank you for the contribution. Also, just to set expectations; > patches like this almost always take a while to get into shape for > PostgreSQL. Please expect a lot of requests to change things. That's > fairly standard procedure. The process often drags on for months and > in some less common cases, years. > > David > > [1] > https://www.postgresql.org/message-id/flat/6227.1334559170%40sss.pgh.pa.us#82c771950ba486dec911923a5e91 > [2] > https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com > [3] https://commitfest.postgresql.org/33/ > [4] https://wiki.postgresql.org/wiki/CommitFest > postgres-rmj.patch Description: Binary data
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Mon, 6 Sep 2021 20:47:37 +0100, Tom Lane said: > I'm confused by your description of this patch. AFAIK, OpenSSL verifies > against the system-wide CA pool by default. Why do we need to do > anything? Experimentally, no it doesn't. Or if it does, then it doesn't verify the CN/altnames of the cert. sslmode=require allows self-signed and name mismatch. verify-ca errors out if there is no ~/.postgresql/root.crt. verify-full too. It seems that currently postgresql verifies the name if and only if verify-full is used, and then only against ~/.postgresql/root.crt CA file. But could be that I missed a config option? -- typedef struct me_s { char name[] = { "Thomas Habets" }; char email[] = { "tho...@habets.se" }; char kernel[]= { "Linux" }; char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt"; }; char pgp[] = { "9907 8698 8A24 F52F 1C2E 87F6 39A4 9EEA 460A 0169" }; char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" }; } me_t;
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Tue, 7 Sep 2021 15:16:51 +0100, Andrew Dunstan said: > can't you specify a CA cert in the system's > CA store if necessary? e.g. on my Fedora system I think it's > /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt I could, but that seems more like a workaround, where I have to change things around as LetsEncrypt switches to another root (I believe they have in the past, but I'm not sure), or the server decides to switch from LetsEncrypt to something else. Then all clients need to update. Such a decision could actually be made by whoever runs the webserver, not the database, and the database just reuses the cert and gets a free ride for cert renewals. So in other words postgresql currently doesn't use the system database at all, and the workaround is to find and copy from the system database. I agree that is a workaround. If you think this is enough of a corner case that the workaround is acceptable, or the added complexity of another sslmode setting isn't worth fixing this edge case, then I assume you have more knowledge about postgres is used in the field than I do. But it's not just about today. I would hope that now with LE that every user of SSL starts using "real" certs. Postgres default settings imply that most people who even enable SSL will not verify the CA nor the name, which is a shame. -- typedef struct me_s { char name[] = { "Thomas Habets" }; char email[] = { "tho...@habets.se" }; char kernel[]= { "Linux" }; char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt"; }; char pgp[] = { "9907 8698 8A24 F52F 1C2E 87F6 39A4 9EEA 460A 0169" }; char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" }; } me_t;
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Tue, 28 Sep 2021 02:09:11 +0100, Bruce Momjian said: > I don't think public CA's are not a good idea for complex setups since > they open the ability for an external party to create certificates that > are trusted by your server's CA, e.g., certificate authentication. I'm not arguing for, and in fact would argue against, public CA for client certs. So that's a separate issue. Note that mTLS prevents a MITM attack that exposes server data even if server cert is compromised or re-issued, so if the install is using client certs (with private CA) then the public CA for server matters much less. You can end up at the wrong server, yes, and provide data as INSERT, but can't steal or corrupt existing data. And you say for complex setups. Fair enough. But currently I'd say the default is wrong, and what should be default is not configurable. -- typedef struct me_s { char name[] = { "Thomas Habets" }; char email[] = { "tho...@habets.se" }; char kernel[]= { "Linux" }; char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt"; }; char pgp[] = { "9907 8698 8A24 F52F 1C2E 87F6 39A4 9EEA 460A 0169" }; char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" }; } me_t;
Re: killing perl2host
On Sun, Feb 20, 2022 at 10:51 AM Andrew Dunstan wrote: > On 2/19/22 16:34, Andres Freund wrote: > >> The first > >> removes perl2host completely and adjusts all the places that use it. > >> The second removes almost all the remaining msys special processing in > >> TAP tests. Very nice improvement. Thanks for sorting this out. I didn't enjoy playing "Wordle" with the build farm.
Re: Missed condition-variable wakeups on FreeBSD
On Sun, Feb 27, 2022 at 8:07 AM Tom Lane wrote: > I don't know much about how gdb interacts with kernel calls on > FreeBSD, but I speculate that the poll(2) call returns with EINTR > after gdb releases the process, and then things resume fine, Yeah, at least FreeBSD and macOS interrupt system calls when you send SIGSTOP + SIGCONT directly, or when gdb and lldb do something similar via ptrace. The same applies on Linux, except that Linux restarts the system call automatically (even though the man page has this in the list of system calls that never restart) so you don't usually notice (though you can see it with strace). It's not really clear to me what should happen given the language around restarts is all about signal handlers, and stop + cont are system magic, not signal handlers. Anyway... > suggesting that we lost an interrupt somewhere. So it's happening on an i386 kernel, with WAIT_USE_POLL (not WAIT_USE_KQUEUE), but only under the build farm script (not when you ran it manually in loop) hmmm.
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad
On Sun, Feb 27, 2022 at 10:29 PM Julien Rouhaud wrote: > On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote: > > On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote: > > > I suspect the easiest is to just convert that usleep to a WaitLatch(). > > > That'd > > > require adding a new enum value to WaitEventTimeout in 14. Which probably > > > is > > > fine? > > > > We've added wait events in back-branches in the past, so this does not > > strike me as a problem as long as you add the new entry at the end of > > the enum, while keeping things ordered on HEAD. > > +1 +1 Sleeps like these are also really bad for ProcSignalBarrier, which I was expecting to be the impetus for fixing any remaining loops like this. With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s on my FreeBSD workstation! It seems a little strange to introduce a new wait event that will very often appear into a stable branch, but ... it is actually telling the truth, so there is that. The sleep/poll loop in RegisterSyncRequest() may also have another problem. The comment explains that it was a deliberate choice not to do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't think there's an excuse to ignore postmaster death in a loop that presumably becomes infinite if the checkpointer exits. I guess we could do: - pg_usleep(1L); + WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, WAIT_EVENT_SYNC_REQUEST); But... really, this should be waiting on a condition variable that the checkpointer broadcasts on when the queue goes from full to not full, no? Perhaps for master only? From 23abdf95dea74b914dc56a46739a63ff86035f51 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 28 Feb 2022 11:27:05 +1300 Subject: [PATCH] Wake up for latches in CheckpointWriteDelay(). The checkpointer shouldn't ignore its latch. Other backends may be waiting for it to drain the request queue. Hopefully real systems don't have a full queue often, but the condition is reached easily when shared buffers is very small. This involves defining a new wait event, which will appear in the pg_stat_activity view often due to spread checkpoints. Back-patch only to 14. Even though the problem exists in earlier branches too, it's hard to hit there. In 14 we stopped using signal handers for latches on Linux, *BSD and macOS, which were previously hiding this problem by interrupting the sleep (though not reliably, as the signal could arrive before the sleep begins; precisely the problem latches address). Reported-by: Andres Freund Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de --- doc/src/sgml/monitoring.sgml| 4 src/backend/postmaster/checkpointer.c | 5 - src/backend/utils/activity/wait_event.c | 3 +++ src/include/utils/wait_event.h | 1 + 4 files changed, 12 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index bf7625d988..9dc3978f35 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -2236,6 +2236,10 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser BaseBackupThrottle Waiting during base backup when throttling activity. + + CheckpointerWriteDelay + Waiting between writes while performing a checkpoint. + PgSleep Waiting due to a call to pg_sleep or diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 4488e3a443..9d9aad166e 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -726,7 +726,10 @@ CheckpointWriteDelay(int flags, double progress) * Checkpointer and bgwriter are no longer related so take the Big * Sleep. */ - pg_usleep(10L); + WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, + 100, + WAIT_EVENT_CHECKPOINT_WRITE_DELAY); + ResetLatch(MyLatch); } else if (--absorb_counter <= 0) { diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c index 60972c3a75..0706e922b5 100644 --- a/src/backend/utils/activity/wait_event.c +++ b/src/backend/utils/activity/wait_event.c @@ -485,6 +485,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w) case WAIT_EVENT_BASE_BACKUP_THROTTLE: event_name = "BaseBackupThrottle"; break; + case WAIT_EVENT_CHECKPOINT_WRITE_DELAY: + event_name = "CheckpointWriteDelay"; + break; case WAIT_EVENT_PG_SLEEP: event_name = "PgSleep"; break; diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h index 395d325c5f..d0345c6b49 100644 --- a/src/include/utils/wait_event.h +++ b/src/include/utils/wait_event.h @@ -141,6
Re: Missed condition-variable wakeups on FreeBSD
On Sun, Feb 27, 2022 at 11:18 AM Melanie Plageman wrote: > How could it be that worker 2 is waiting on the build barrier in > PHJ_BUILD_HASHING_INNER and worker 1 and the leader are waiting on it > with it supposedly in PHJ_BUILD_HASHING_OUTER? That'd be consistent with a wakeup going missing, somehow. W2 was supposed to be released but somehow didn't get the message (until a random debugger-induced EINTR releases it and it sees all the state it needs to proceed in shared memory), meanwhile the other two got to the end of the next phase and are waiting for it.
Re: Missed condition-variable wakeups on FreeBSD
On Sun, Feb 27, 2022 at 9:44 AM Andres Freund wrote: > > (gdb) p debug_query_string > > $1 = 0x21873090 "select count(*) from simple r join simple s using (id);" > > (gdb) bt > > #0 _poll () at _poll.S:4 > > #1 0x21701361 in __thr_poll (fds=0x219dc170, nfds=2, timeout=-1) at > > /usr/src/lib/libthr/thread/thr_syscalls.c:338 > > #2 0x215eaf3f in poll (pfd=0x219dc170, nfds=2, timeout=-1) at > > /usr/src/lib/libc/sys/poll.c:47 > > #3 0x0097b0fd in WaitEventSetWaitBlock (set=, > > cur_timeout=-1, occurred_events=, nevents=) > > at latch.c:1171 > > The next time this happens / if you still have this open, perhaps it could be > worth checking if there's a byte in the self pipe? As a basic sanity check I'd like to see pfd[0], pfd[1] and the output of fstat -p PID to see unconsumed bytes in the pipe. For example "echo hello | sleep 60" shows "0* pipe ... 6 rw" for sleep, but "echo hello | more" shows "0* pipe ... 0 rw".
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad
On Mon, Feb 28, 2022 at 2:36 PM Andres Freund wrote: > On February 27, 2022 4:19:21 PM PST, Thomas Munro > wrote: > >It seems a little strange to introduce a new wait event that will very > >often appear into a stable branch, but ... it is actually telling the > >truth, so there is that. > > In the back branches it needs to be at the end of the enum - I assume you > intended that just to be for HEAD. Yeah. > I wonder whether in HEAD we shouldn't make that sleep duration be computed > from the calculation in IsOnSchedule... I might look into this. > >The sleep/poll loop in RegisterSyncRequest() may also have another > >problem. The comment explains that it was a deliberate choice not to > >do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't > >think there's an excuse to ignore postmaster death in a loop that > >presumably becomes infinite if the checkpointer exits. I guess we > >could do: > > > >- pg_usleep(1L); > >+ WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, > >WAIT_EVENT_SYNC_REQUEST); > > > >But... really, this should be waiting on a condition variable that the > >checkpointer broadcasts on when the queue goes from full to not full, > >no? Perhaps for master only? > > Looks worth improving, but yes, I'd not do it in the back branches. 0003 is a first attempt at that, for master only (on top of 0002 which is the minimal fix). This shaves another second off 027_stream_regress.pl on my workstation. The main thing I realised is that I needed to hold interrupts while waiting, which seems like it should go away with 'tombstone' files as discussed in other threads. That's not a new problem in this patch, it just looks more offensive to the eye when you spell it out, instead of hiding it with an unreported sleep/poll loop... > I do think it's worth giving that sleep a proper wait event though, even in > the back branches. I'm thinking that 0002 should be back-patched all the way, but 0001 could be limited to 14. From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 28 Feb 2022 11:27:05 +1300 Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay(). The checkpointer shouldn't ignore its latch. Other backends may be waiting for it to drain the request queue. Hopefully real systems don't have a full queue often, but the condition is reached easily when shared buffers is very small. This involves defining a new wait event, which will appear in the pg_stat_activity view often due to spread checkpoints. Back-patch only to 14. Even though the problem exists in earlier branches too, it's hard to hit there. In 14 we stopped using signal handlers for latches on Linux, *BSD and macOS, which were previously hiding this problem by interrupting the sleep (though not reliably, as the signal could arrive before the sleep begins; precisely the problem latches address). Reported-by: Andres Freund Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de --- doc/src/sgml/monitoring.sgml| 4 src/backend/postmaster/checkpointer.c | 8 +++- src/backend/utils/activity/wait_event.c | 3 +++ src/include/utils/wait_event.h | 1 + 4 files changed, 15 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 9fb62fec8e..8620aaddc7 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -2235,6 +2235,10 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser BaseBackupThrottle Waiting during base backup when throttling activity. + + CheckpointerWriteDelay + Waiting between writes while performing a checkpoint. + PgSleep Waiting due to a call to pg_sleep or diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 4488e3a443..a59c3cf020 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -484,6 +484,9 @@ CheckpointerMain(void) } ckpt_active = false; + + /* We may have received an interrupt during the checkpoint. */ + HandleCheckpointerInterrupts(); } /* Check for archive_timeout and switch xlog files if necessary. */ @@ -726,7 +729,10 @@ CheckpointWriteDelay(int flags, double progress) * Checkpointer and bgwriter are no longer related so take the Big * Sleep. */ - pg_usleep(10L); + WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, + 100, + WAIT_EVENT_CHECKPOINT_WRITE_DELAY); + ResetLatch(MyLatch); } else if (--absorb_counter <= 0) { diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Fri, Mar 4, 2022 at 10:04 PM Kyotaro Horiguchi wrote: > And I made a quick hack on do_pg_start_backup. And I found that > pg_basebackup copies in-place tablespaces under the *current > directory*, which is not ok at all:( Hmm. Which OS are you on? Looks OK here -- the "in place" tablespace gets copied as a directory under pg_tblspc, no symlink: postgres=# set allow_in_place_tablespaces = on; SET postgres=# create tablespace ts1 location ''; CREATE TABLESPACE postgres=# create table t (i int) tablespace ts1; CREATE TABLE postgres=# insert into t values (1), (2); INSERT 0 2 postgres=# create user replication replication; CREATE ROLE $ pg_basebackup --user replication -D pgdata2 $ ls -slaph pgdata/pg_tblspc/ total 4.0K 0 drwx-- 3 tmunro tmunro 19 Mar 4 23:16 ./ 4.0K drwx-- 19 tmunro tmunro 4.0K Mar 4 23:16 ../ 0 drwx-- 3 tmunro tmunro 29 Mar 4 23:16 16384/ $ ls -slaph pgdata2/pg_tblspc/ total 4.0K 0 drwx-- 3 tmunro tmunro 19 Mar 4 23:16 ./ 4.0K drwx-- 19 tmunro tmunro 4.0K Mar 4 23:16 ../ 0 drwx-- 3 tmunro tmunro 29 Mar 4 23:16 16384/ $ ls -slaph pgdata/pg_tblspc/16384/PG_15_202203031/5/ total 8.0K 0 drwx-- 2 tmunro tmunro 19 Mar 4 23:16 ./ 0 drwx-- 3 tmunro tmunro 15 Mar 4 23:16 ../ 8.0K -rw--- 1 tmunro tmunro 8.0K Mar 4 23:16 16385 $ ls -slaph pgdata2/pg_tblspc/16384/PG_15_202203031/5/ total 8.0K 0 drwx-- 2 tmunro tmunro 19 Mar 4 23:16 ./ 0 drwx-- 3 tmunro tmunro 15 Mar 4 23:16 ../ 8.0K -rw--- 1 tmunro tmunro 8.0K Mar 4 23:16 16385 The warning from readlink() while making the mapping file isn't ideal, and perhaps we should suppress that with something like the attached. Or does the missing map file entry break something on Windows? From 6f001ec46c5e5f6ffa8e103f2b0d711e6904b763 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 4 Mar 2022 23:07:46 +1300 Subject: [PATCH] Fix warning in basebackup of in-place tablespaces. While building the tablespace map for a base backup, don't call readlink() on directories under pg_tblspc. --- src/backend/access/transam/xlog.c | 5 + 1 file changed, 5 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d2bd7a357..b92cc66afb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -66,6 +66,7 @@ #include "catalog/pg_control.h" #include "catalog/pg_database.h" #include "common/controldata_utils.h" +#include "common/file_utils.h" #include "executor/instrument.h" #include "miscadmin.h" #include "pg_trace.h" @@ -8292,6 +8293,10 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name); + /* Skip in-place tablespaces (testing use only) */ + if (get_dirent_type(fullpath, de, false, ERROR) == PGFILETYPE_DIR) +continue; + #if defined(HAVE_READLINK) || defined(WIN32) rllen = readlink(fullpath, linkpath, sizeof(linkpath)); if (rllen < 0) -- 2.30.2
Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b
On Fri, Mar 4, 2022 at 10:37 PM Bharath Rupireddy wrote: > It looks like regression tests are failing on Windows Server 2019 [1]. > Ignore if it's reported elsewhere. It's broken since 46ab07ff "Clean up assorted failures under clang's -fsanitize=undefined checks.": https://github.com/postgres/postgres/commits/master I don't see what that patch has to do with the symptoms. It looks a bit like the cluster started by the "startcreate" step ("pg_ctl.exe start ...") is mysteriously no longer running when we get to the "test_pl" step ("Connection refused").
Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b
On Sat, Mar 5, 2022 at 7:10 AM Andres Freund wrote: > Or perhaps pg_ctl ought to pass CREATE_NEW_PROCESS_GROUP to CreateProcess()? > The lack of a process group would explain why we're getting signalled on > ctrl-c... I thought that sounded promising, given that the recent Cirrus agent commit you pointed to says "Always kill spawned shell's process group to avoid pipe FD hangs", and given that we do something conceptually similar on Unix. It doesn't seem to help, though... https://cirrus-ci.com/task/5572163880091648
Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b
On Sat, Mar 5, 2022 at 10:56 AM Andres Freund wrote: > I suspect one also needs the console detach thing. I tried adding DETACHED_PROCESS (which should be like calling FreeConsole() in child process) and then I tried CREATE_NEW_CONSOLE instead, on top of CREATE_NEW_PROCESS_GROUP. Neither helped, though I lost the postmaster's output. Things I tried and their output are here: https://github.com/macdice/postgres/commits/hack3
Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b
On Sat, Mar 5, 2022 at 4:19 PM Andres Freund wrote: > https://github.com/cirruslabs/cirrus-ci-agent/issues/218#issuecomment-1059657781 Oof. Nice detective work. > As indicated in the message above it, there's a workaround. Not sure if worth > committing, if they were to fix it in a few days? cfbot could be repaired by > just adding a repo environment variable of CIRRUS_AGENT_VERSION 1.73.2... > OTOH, committing and reverting a single line + comment is cheap. I vote for committing that workaround into the tree temporarily, because it's not just cfbot, it's also everyone's dev branches on Github + the official mirror that are red.
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Tue, Mar 8, 2022 at 12:58 AM Michael Paquier wrote: > On Fri, Mar 04, 2022 at 11:26:43PM +1300, Thomas Munro wrote: > > + /* Skip in-place tablespaces (testing use only) */ > > + if (get_dirent_type(fullpath, de, false, ERROR) == PGFILETYPE_DIR) > > + continue; > > I saw the warning when testing base backups with in-place tablespaces > and it did not annoy me much, but, yes, that can be confusing. > > Junction points are directories, no? Are you sure that this works > correctly on WIN32? It seems to me that we'd better use readlink() > only for entries in pg_tlbspc/ that are PGFILETYPE_LNK on non-WIN32 > and pgwin32_is_junction() on WIN32. Thanks, you're right. Test on a Win10 VM. Here's a new version. From e231fd45e153279bfb2dd3441b2a34797c446289 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 8 Mar 2022 10:25:59 +1300 Subject: [PATCH] Suppress pg_basebackup warning for in-place tablespaces. Reported-by: Kyotaro Horiguchi Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/20220304.165449.1200020258723305904.horikyota.ntt%40gmail.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d2bd7a357..ab8be77bd2 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -66,6 +66,7 @@ #include "catalog/pg_control.h" #include "catalog/pg_database.h" #include "common/controldata_utils.h" +#include "common/file_utils.h" #include "executor/instrument.h" #include "miscadmin.h" #include "pg_trace.h" @@ -8292,6 +8293,20 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name); + /* + * Skip anything that isn't a symlink/junction. For testing only, + * we sometimes use allow_in_place_tablespaces to create + * directories directly under pg_tblspc, which would produce + * spurious warnings below. + */ +#ifdef WIN32 + if (!pgwin32_is_junction(fullpath)) +continue; +#else + if (get_dirent_type(fullpath, de, false, ERROR) != PGFILETYPE_LNK) +continue; +#endif + #if defined(HAVE_READLINK) || defined(WIN32) rllen = readlink(fullpath, linkpath, sizeof(linkpath)); if (rllen < 0) -- 2.35.1
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Tue, Mar 8, 2022 at 10:39 AM Thomas Munro wrote: > Test on a Win10 VM. Erm, "Tested" (as in, I tested), I meant to write...
Re: Adding CI to our tree
On Thu, Mar 10, 2022 at 9:37 AM Justin Pryzby wrote: > On Wed, Mar 09, 2022 at 10:12:54AM -0800, Andres Freund wrote: > > On 2022-03-09 11:47:23 -0600, Justin Pryzby wrote: > > > I'm curious what you think of this patch. > > > > > > It makes check-world on freebsd over 30% faster - saving 5min. > > > > That's nice! While -Og makes interactive debugging noticeably harder IME, > > it's > > not likely to be a large enough difference just for backtraces etc. > > Yeah. gcc(1) claims that -Og can improve debugging. Wow, I see the effect on Cirrus -- test_world ran in 8:55 instead of 12:43 when I tried (terrible absolute times, but fantastic improvement!). Hmm, on my local FreeBSD 13 box I saw 5:07 -> 5:03 with this change. My working theory had been that there is something bad happening in the I/O stack under concurrency making FreeBSD on Cirrus/GCP very slow (ie patterns to stall on slow cloud I/O waits, something I hope to dig into when post-freeze round tuits present themselves), but that doesn't gel with this huge improvement from noodling with optimiser details, and I don't know why I don't see something similar locally. I'm confused. Just BTW it's kinda funny that we say -ggdb for macOS and then we use lldb to debug cores in cores_backtrace.sh. I suppose it would be more correct to say -glldb, but doubt it matters much...
Re: Adding CI to our tree
On Thu, Mar 10, 2022 at 4:33 PM Andres Freund wrote: > On 2022-03-10 15:43:16 +1300, Thomas Munro wrote: > > I'm confused. > > The "terrible IO wait" thing was before we reduced the number of CPUs and > concurrent jobs. It makes sense to me that with just two CPUs we're CPU bound, > in which case -Og obviously can make a difference. Oh, duh, yeah, that makes sense when you put it that way and considering the CPU graph: -O0: https://cirrus-ci.com/task/4578631912521728 -Og: https://cirrus-ci.com/task/5239486182326272
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 11, 2022 at 6:31 PM Thomas Munro wrote: > Thanks for your review of 0001! It gave me a few things to think > about and some good improvements. And just in case it's useful, here's what changed between v21 and v22.. diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 86a7b4c5c8..0d0c556b7c 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -90,8 +90,8 @@ XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size) state->decode_buffer = buffer; state->decode_buffer_size = size; - state->decode_buffer_head = buffer; state->decode_buffer_tail = buffer; + state->decode_buffer_head = buffer; } /* @@ -271,7 +271,7 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr) /* * See if we can release the last record that was returned by - * XLogNextRecord(), to free up space. + * XLogNextRecord(), if any, to free up space. */ void XLogReleasePreviousRecord(XLogReaderState *state) @@ -283,16 +283,16 @@ XLogReleasePreviousRecord(XLogReaderState *state) /* * Remove it from the decoded record queue. It must be the oldest item -* decoded, decode_queue_tail. +* decoded, decode_queue_head. */ record = state->record; - Assert(record == state->decode_queue_tail); + Assert(record == state->decode_queue_head); state->record = NULL; - state->decode_queue_tail = record->next; + state->decode_queue_head = record->next; - /* It might also be the newest item decoded, decode_queue_head. */ - if (state->decode_queue_head == record) - state->decode_queue_head = NULL; + /* It might also be the newest item decoded, decode_queue_tail. */ + if (state->decode_queue_tail == record) + state->decode_queue_tail = NULL; /* Release the space. */ if (unlikely(record->oversized)) @@ -302,11 +302,11 @@ XLogReleasePreviousRecord(XLogReaderState *state) } else { - /* It must be the tail record in the decode buffer. */ - Assert(state->decode_buffer_tail == (char *) record); + /* It must be the head (oldest) record in the decode buffer. */ + Assert(state->decode_buffer_head == (char *) record); /* -* We need to update tail to point to the next record that is in the +* We need to update head to point to the next record that is in the * decode buffer, if any, being careful to skip oversized ones * (they're not in the decode buffer). */ @@ -316,8 +316,8 @@ XLogReleasePreviousRecord(XLogReaderState *state) if (record) { - /* Adjust tail to release space up to the next record. */ - state->decode_buffer_tail = (char *) record; + /* Adjust head to release space up to the next record. */ + state->decode_buffer_head = (char *) record; } else { @@ -327,8 +327,8 @@ XLogReleasePreviousRecord(XLogReaderState *state) * we'll keep overwriting the same piece of memory if we're not * doing any prefetching. */ - state->decode_buffer_tail = state->decode_buffer; state->decode_buffer_head = state->decode_buffer; + state->decode_buffer_tail = state->decode_buffer; } } } @@ -351,7 +351,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg) /* Release the last record returned by XLogNextRecord(). */ XLogReleasePreviousRecord(state); - if (state->decode_queue_tail == NULL) + if (state->decode_queue_head == NULL) { *errormsg = NULL; if (state->errormsg_deferred) @@ -376,7 +376,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg) * XLogRecXXX(xlogreader) macros, which work with the decoder rather than * the record for historical reasons. */ - state->record = state->decode_queue_tail; + state->record = state->decode_queue_head; /* * Update the pointers to the beginning and one-past-the-end of this @@ -428,12 +428,12 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) if (!XLogReaderHasQueuedRecordOrError(state)) XLogReadAhead(state, false /* nonblocking */ ); - /* Consume the tail record or error. */ + /* Consume the head record or error. */ decoded = XLogNextRecord(state, errormsg);
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Tue, Mar 8, 2022 at 4:01 PM Kyotaro Horiguchi wrote: > At Tue, 8 Mar 2022 10:28:46 +0900, Michael Paquier > wrote in > > On Tue, Mar 08, 2022 at 10:06:50AM +0900, Kyotaro Horiguchi wrote: > > > At Tue, 8 Mar 2022 10:39:06 +1300, Thomas Munro > > > wrote in > > >> Thanks, you're right. Test on a Win10 VM. Here's a new version. > > > > Looks fine to me. > > > > > FYI, on Windows11, pg_basebackup didn't work correctly without the > > > patch. So this looks like fixing an undiscovered bug as well. > > > > Well, that's not really a long-time bug but just a side effect of > > in-place tablespaces because we don't use them in many test cases > > yet, is it? > > No, we don't. So just FYI. Ok, I pushed the fix for pg_basebackup. As for the complaint about pg_tablespace_location() failing, would it be better to return an empty string? That's what was passed in as LOCATION. Something like the attached. diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 4568749d23..1cd70a8eaa 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -15,6 +15,7 @@ #include "postgres.h" #include +#include #include #include #include @@ -306,6 +307,24 @@ pg_tablespace_location(PG_FUNCTION_ARGS) */ snprintf(sourcepath, sizeof(sourcepath), "pg_tblspc/%u", tablespaceOid); + /* Return empty string for in-place tablespaces. */ +#ifdef WIN32 + if (!pgwin32_is_junction(sourcepath)) + PG_RETURN_TEXT_P(cstring_to_text("")); +#else + { + struct stat stat_buf; + + if (lstat(sourcepath, &stat_buf) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", + sourcepath))); + if (!S_ISLNK(stat_buf.st_mode)) + PG_RETURN_TEXT_P(cstring_to_text("")); + } +#endif + rllen = readlink(sourcepath, targetpath, sizeof(targetpath)); if (rllen < 0) ereport(ERROR,
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Tue, Mar 15, 2022 at 2:33 PM Thomas Munro wrote: > As for the complaint about pg_tablespace_location() failing, would it > be better to return an empty string? That's what was passed in as > LOCATION. Something like the attached. (Hrrmm, the contract for pgwin32_is_junction() is a little weird: false means "success, but no" and also "failure, you should check errno". But we never do.)
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Tue, Mar 15, 2022 at 2:50 PM Michael Paquier wrote: > On Tue, Mar 15, 2022 at 02:33:17PM +1300, Thomas Munro wrote: > > As for the complaint about pg_tablespace_location() failing, would it > > be better to return an empty string? That's what was passed in as > > LOCATION. Something like the attached. > > Hmm, I don't think so. The point of the function is to be able to > know the location of a tablespace at SQL level so as we don't have any > need to hardcode its location within any external tests (be it a > pg_regress test or a TAP test) based on how in-place tablespace paths > are built in the backend, so I think that we'd better report either a > relative path from data_directory or an absolute path, but not an > empty string. > > In any case, I'd suggest to add a regression test. What I have sent > upthread would be portable enough. Fair enough. No objections here.
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Tue, Mar 15, 2022 at 10:30 PM Michael Paquier wrote: > So, which one of a relative path or an absolute path do you think > would be better for the user? My preference tends toward the relative > path, as we know that all those tablespaces stay in pg_tblspc/ so one > can make the difference with normal tablespaces more easily. The > barrier is thin, though :p Sounds good to me.
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad
On Wed, Mar 2, 2022 at 10:58 AM Andres Freund wrote: > On 2022-03-02 06:46:23 +1300, Thomas Munro wrote: > > From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001 > > From: Thomas Munro > > Date: Mon, 28 Feb 2022 11:27:05 +1300 > > Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay(). > > LGTM. Would be nice to have this fixed soon, even if it's just to reduce test > times :) Thanks for the review. Pushed to master and 14, with the wait event moved to the end of the enum for the back-patch. > > From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001 > > From: Thomas Munro > > Date: Tue, 1 Mar 2022 11:38:27 +1300 > > Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest(). > > LGTM. Pushed as far back as 12. It could be done for 10 & 11, but indeed the code starts getting quite different back there, and since there are no field reports, I think that's OK for now. A simple repro, for the record: run installcheck with shared_buffers=256kB, and then partway through, kill -STOP $checkpointer to simulate being stalled on IO for a while. Backends will soon start waiting for the checkpointer to drain the queue while dropping relations. This state was invisible to pg_stat_activity, and hangs forever if you kill the postmaster and CONT the checkpointer. > > From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001 > > From: Thomas Munro > > Date: Tue, 1 Mar 2022 17:34:43 +1300 > > Subject: [PATCH v2 3/3] Use condition variable to wait when sync request > > queue > > is full. > [review] I'll come back to 0003 (condition variable-based improvement) a bit later.
Re: USE_BARRIER_SMGRRELEASE on Linux?
On Sat, Feb 19, 2022 at 7:01 AM Nathan Bossart wrote: > Here is a new patch. This is essentially the same as v1, but I've spruced > up the comments and the commit message. Hi Nathan, Pushed and back-patched (it's slightly different before 12). Thanks!
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Thu, Mar 17, 2022 at 3:53 PM Michael Paquier wrote: > > On Wed, Mar 16, 2022 at 05:15:58PM +0900, Kyotaro Horiguchi wrote: > > I'm not sure that the "of the symbolic link in pg_tblspc/" is > > needed. And allow_in_place_tablespaces alone doesn't create in-place > > tablespace. So this might need rethink at least for the second point. > > Surely this can be improved. I was not satisfied with this paragraph > after re-reading it this morning, so I have just removed it, rewording > slightly the part for in-place tablespaces that is still necessary. + +A relative path to the data directory is returned for tablespaces +created when is +enabled. + + I think what Horiguchi-san was pointing out above is that you need to enable the GUC *and* say LOCATION '', which the new paragraph doesn't capture. What do you think about this: A path relative to the data directory is returned for in-place tablespaces (see ).
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
On Thu, Mar 17, 2022 at 7:18 PM Michael Paquier wrote: > On Thu, Mar 17, 2022 at 04:34:30PM +1300, Thomas Munro wrote: > > I think what Horiguchi-san was pointing out above is that you need to > > enable the GUC *and* say LOCATION '', which the new paragraph doesn't > > capture. What do you think about this: > > > > A path relative to the data directory is returned for in-place > > tablespaces (see ). > > An issue I have with this wording is that we give nowhere in the docs > an explanation of about the term "in-place tablespace", even if it can > be guessed from the name of the GUC. Maybe we don't need this paragraph at all. Who is it aimed at? > Another idea would be something like that: > "A relative path to the data directory is returned for tablespaces > created with an empty location string specified in the CREATE > TABLESPACE query when allow_in_place_tablespaces is enabled (see link > blah)." > > But perhaps that's just me being overly pedantic :p I don't really want to spill details of this developer-only stuff onto more manual sections... It's not really helping users if we confuse them with irrelevant details of a feature they shouldn't be using, is it? And the existing treatment "Returns the file system path that this tablespace is located in" is not invalidated by this special case, so maybe we shouldn't mention it?
Re: WIP: WAL prefetch (another approach)
On Mon, Mar 14, 2022 at 8:17 PM Julien Rouhaud wrote: > Great! I'm happy with 0001 and I think it's good to go! I'll push 0001 today to let the build farm chew on it for a few days before moving to 0002.
Re: Declare PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY for aarch64
On Thu, Mar 17, 2022 at 1:32 AM Yura Sokolov wrote: > So I believe it is safe to define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY > for aarch64 Agreed, and pushed. There was another thread that stalled, so I added a reference and a reviewer from that to your commit message. This should probably also be set for RISCV64.
Re: Probable CF bot degradation
On Sat, Mar 19, 2022 at 5:07 AM Julien Rouhaud wrote: > On Fri, Mar 18, 2022 at 07:43:47PM +0400, Pavel Borisov wrote: > > Hi, hackers! > > I've noticed that CF bot hasn't been running active branches from yesterday: > > https://github.com/postgresql-cfbot/postgresql/branches/active > > > > Also, there is no new results on the current CF page on cputube. > > I don't know if it is a problem or kind of scheduled maintenance though. > > There was a github incident yesterday, that was resolved a few hours ago > ([1]), > maybe the cfbot didn't like that. Yeah, for a while it was seeing: remote: Internal Server Error To github.com:postgresql-cfbot/postgresql.git ! [remote rejected] commitfest/37/3489 -> commitfest/37/3489 (Internal Server Error) error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git' Unfortunately cfbot didn't handle that failure very well and it was waiting for a long timeout before scheduling more jobs. It's going again now, and I'll try to make it more resilient against that type of failure...
Re: [PATCH] Add native windows on arm64 support
On Wed, Feb 23, 2022 at 11:09 PM Niyas Sait wrote: > I have created a patch that adds support for building Postgres for the > windows/arm64 platform using the MSVC toolchain. Following changes have been > included > > 1. Extend MSVC scripts to handle ARM64 platform. > 2. Add arm64 definition of spin_delay function. > 3. Exclude arm_acle.h import with MSVC compiler. > > Compilation steps are consistent and similar to other windows platforms. > > The change has been tested on windows/x86_64 and windows/arm64 and all > regression tests passes on both platforms. +# arm64 linker only supports dynamic base address +my $cfgrandbaseaddress = $self->{platform} eq 'ARM64' ? 'true' : 'false'; ... + $cfgrandbaseaddress Does that mean that you can't turn off ASLR on this arch? Does it cause random backend failures due to inability to map shared memory at the expected address? Our experience with EXEC_BACKEND on various Unix systems tells us that you have to turn off ASLR if you want 100% success rate on starting backends, but I guess it depends on the details of how the randomisation is done. Any interest in providing a build farm animal, so that we could know that this works?
Re: Probable CF bot degradation
On Sat, Mar 19, 2022 at 9:41 AM Pavel Borisov wrote: >> >> remote: Internal Server Error >> To github.com:postgresql-cfbot/postgresql.git >> ! [remote rejected] commitfest/37/3489 -> commitfest/37/3489 >> (Internal Server Error) >> error: failed to push some refs to >> 'github.com:postgresql-cfbot/postgresql.git' > > I am seeing commitfest/37/3489 in "triggered" state for a long time. No > progress is seen on this branch, though I started to see successful runs on > the other branches now. > Could you see this particular branch and maybe restart it manually? I don't seem to have a way to delete that... it looks like when github told us "Internal Server Error", it had partially succeeded and the new branch (partially?) existed, but something was b0rked and it confused Cirrus. 🤷 There is already another build for 3489 that is almost finished now so I don't think that stale TRIGGERED one is stopping anything from working and I guess it will eventually go away by itself somehow...
Re: A test for replay of regression tests
Another failure under 027_stream_regress.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05 vacuum ... FAILED 3463 ms I'll try to come up with the perl needed to see the regression.diffs next time...
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro wrote: > I'll push 0001 today to let the build farm chew on it for a few days > before moving to 0002. Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to fail occasionally, but that predates the above commit. I didn't follow the existing discussion on that, so I'll try to look into that tomorrow. Here's a rebase of the 0002 patch, now called 0001 From 3ac04122e635b98c50d6e48677fe74535d631388 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 20 Mar 2022 16:56:12 +1300 Subject: [PATCH v24] Prefetch referenced data in recovery, take II. Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Since not all OSes have that system call, "try" is provided so that it can be enabled on operating systems where it is available, and that is used in 027_stream_regress.pl so that we effectively exercise on and off behaviors in the build farm. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size limits the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Tomas Vondra Reviewed-by: Alvaro Herrera (earlier version) Reviewed-by: Andres Freund (earlier version) Reviewed-by: Justin Pryzby (earlier version) Tested-by: Tomas Vondra (earlier version) Tested-by: Jakub Wartak (earlier version) Tested-by: Dmitry Dolgov <9erthali...@gmail.com> (earlier version) Tested-by: Sait Talha Nisanci (earlier version) Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com --- doc/src/sgml/config.sgml | 64 ++ doc/src/sgml/monitoring.sgml | 77 +- doc/src/sgml/wal.sgml | 12 + src/backend/access/transam/Makefile | 1 + src/backend/access/transam/xlog.c | 2 + src/backend/access/transam/xlogprefetcher.c | 968 ++ src/backend/access/transam/xlogreader.c | 13 + src/backend/access/transam/xlogrecovery.c | 160 ++- src/backend/access/transam/xlogutils.c| 27 +- src/backend/catalog/system_views.sql | 13 + src/backend/storage/freespace/freespace.c | 3 +- src/backend/storage/ipc/ipci.c| 3 + src/backend/utils/misc/guc.c | 53 +- src/backend/utils/misc/postgresql.conf.sample | 5 + src/include/access/xlog.h | 1 + src/include/access/xlogprefetcher.h | 51 + src/include/access/xlogreader.h | 8 + src/include/access/xlogutils.h| 3 +- src/include/catalog/pg_proc.dat | 8 + src/include/utils/guc.h | 4 + src/test/recovery/t/027_stream_regress.pl | 3 + src/test/regress/expected/rules.out | 10 + src/tools/pgindent/typedefs.list | 7 + 23 files changed, 1434 insertions(+), 62 deletions(-) create mode 100644 src/backend/access/transam/xlogprefetcher.c create mode 100644 src/include/access/xlogprefetcher.h diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 7a48973b3c..ce84f379a8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3644,6 +3644,70 @@ include_dir 'conf.d' + + +Recovery + + + configuration + of recovery + general settings + + + + This section describes the settings that apply to recovery in general, + affecting crash recovery, streaming replication and archive-based + replication. + + + + + + recovery_prefetch (enum) + + recovery_prefetch configuration parameter + + + + +Whether to try to prefetch blocks that are referenced in the WAL that +are not yet in the buffer pool, during recovery. Valid values are +off (the default), on and +try. The setting try enables +prefetching only if the operating system provides the +posix_fadvise function, which is currently used +to implement prefetching. Note that some operating systems provide the +function, but don't actually perform any prefetching. + + +Prefetching blocks that will soon be needed can reduce I/O wait times +during recovery with some workloads. +See also the and + settings, which limit +prefetching activity. + + + + + + wal_decode_buffer_size (integer) +
Re: WIP: WAL prefetch (another approach)
On Sun, Mar 20, 2022 at 5:36 PM Thomas Munro wrote: > Clearly 018_wal_optimize.pl is flapping Correction, 019_replslot_limit.pl, discussed at https://www.postgresql.org/message-id/flat/83b46e5f-2a52-86aa-fa6c-8174908174b8%40iki.fi .
Re: A test for replay of regression tests
On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro wrote: > Another failure under 027_stream_regress.pl: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05 > > vacuum ... FAILED 3463 ms > > I'll try to come up with the perl needed to see the regression.diffs > next time... Here's my proposed change to achieve that. Here's an example of where it shows up if it fails (from my deliberately sabotaged CI run https://cirrus-ci.com/build/6730380228165632 where I was verifying that it also works on Windows): Unix: https://api.cirrus-ci.com/v1/artifact/task/5421419923243008/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress Windows: https://api.cirrus-ci.com/v1/artifact/task/4717732481466368/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress From 5f6257c4b4a44750e85d77b01c20d6cd39c5c795 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 20 Mar 2022 21:20:42 +1300 Subject: [PATCH] Log regression.diffs in 027_stream_regress.pl. To help diagnose the reasons for a regression test failure inside this TAP test, dump the contents of regression.diffs to the log. While the CI scripts show it automatically, the build farm client does not. Discussion: https://postgr.es/m/CA%2BhUKGK-q9utvxorA8sCwF3S4SzBmxxDgneP4rLqeHWdZxM4Gg%40mail.gmail.com --- src/test/recovery/t/027_stream_regress.pl | 30 --- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl index c40951b7ba..aa972f8958 100644 --- a/src/test/recovery/t/027_stream_regress.pl +++ b/src/test/recovery/t/027_stream_regress.pl @@ -53,15 +53,27 @@ my $outputdir = $PostgreSQL::Test::Utils::tmp_check; # Run the regression tests against the primary. my $extra_opts = $ENV{EXTRA_REGRESS_OPTS} || ""; -system_or_bail($ENV{PG_REGRESS} . " $extra_opts " . - "--dlpath=\"$dlpath\" " . - "--bindir= " . - "--host=" . $node_primary->host . " " . - "--port=" . $node_primary->port . " " . - "--schedule=../regress/parallel_schedule " . - "--max-concurrent-tests=20 " . - "--inputdir=../regress " . - "--outputdir=\"$outputdir\""); +my $rc = system($ENV{PG_REGRESS} . " $extra_opts " . + "--dlpath=\"$dlpath\" " . + "--bindir= " . + "--host=" . $node_primary->host . " " . + "--port=" . $node_primary->port . " " . + "--schedule=../regress/parallel_schedule " . + "--max-concurrent-tests=20 " . + "--inputdir=../regress " . + "--outputdir=\"$outputdir\""); +if ($rc != 0) +{ + # Dump out the regression diffs file, if there is one + my $diffs = "$outputdir/regression.diffs"; + if (-e $diffs) + { + print "=== dumping $diffs ===\n"; + print slurp_file($diffs); + print "=== EOF ===\n"; + } +} +is($rc, 0, 'regression tests pass'); # Clobber all sequences with their next value, so that we don't have # differences between nodes due to caching. -- 2.30.2
Re: A test for replay of regression tests
On Mon, Mar 21, 2022 at 2:34 AM Andrew Dunstan wrote: > On 3/20/22 05:36, Thomas Munro wrote: > > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro wrote: > >> I'll try to come up with the perl needed to see the regression.diffs > >> next time... > > Here's my proposed change to achieve that. > > I think that's OK. Thanks for looking! Pushed.
Re: Probable CF bot degradation
On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent wrote: > Would you know how long the expected bitrot re-check period for CF > entries that haven't been updated is, or could the bitrot-checking > queue be displayed somewhere to indicate the position of a patch in > this queue? I see that your patches were eventually retested. It was set to try to recheck every ~48 hours, though it couldn't quite always achieve that when the total number of eligible submissions is too large. In this case it had stalled for too long after the github outage, which I'm going to try to improve. The reason for the 48+ hour cycle is the Windows tests now take ~25 minutes (since we started actually running all the tests on that platform), and we could only have two Windows tasts running at a time in practice, because the limit for Windows was 8 CPUs, and we use 4 for each task, which means we could only test ~115 branches per day, or actually a shade fewer because it's pretty dumb and only wakes up once a minute to decide what to do, and we currently have 242 submissions (though some don't apply, those are free, so the number varies over time...). There are limits on the Unixes too but they are more generous, and the Unix tests only take 4-10 minutes, so we can ignore that for now, it's all down to Windows. I had been meaning to stump up the USD$10/month it costs to double the CPU limits from the basic free Cirrus account, and I've just now done that and told cfbot it's allowed to test 4 branches at once and to try to test every branch every 24 hours. Let's see how that goes. Here's hoping we can cut down the time it takes to run the tests on Windows... there's some really dumb stuff happening there. Top items I'm aware of: (1) general lack of test concurrency, (2) exec'ing new backends is glacially slow on that OS but we do it for every SQL statement in the TAP tests and every regression test script (I have some patches for this to share after the code freeze). > Additionally, are there plans to validate commits of the main branch > before using them as a base for CF entries, so that "bad" commits on > master won't impact CFbot results as easy? How do you see this working? I have wondered about some kind of way to click a button to say "do this one again now", but I guess that sort of user interaction should ideally happen after merging this thing into the Commitfest app, because it already has auth, and interactive Python/Django web stuff.
Re: Probable CF bot degradation
On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro wrote: > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent > wrote: > > Would you know how long the expected bitrot re-check period for CF > > entries that haven't been updated is, or could the bitrot-checking > > queue be displayed somewhere to indicate the position of a patch in > > this queue? Also, as for the show-me-the-queue page, yeah that's a good idea and quite feasible. I'll look into that in a bit. > > Additionally, are there plans to validate commits of the main branch > > before using them as a base for CF entries, so that "bad" commits on > > master won't impact CFbot results as easy? > > How do you see this working? [Now with more coffee on board] Oh, right, I see, you're probably thinking that we could look at https://github.com/postgres/postgres/commits/master and take the most recent passing commit as a base. Hmm, interesting idea.
Re: Probable CF bot degradation
On Mon, Mar 21, 2022 at 1:41 PM Peter Geoghegan wrote: > BTW, I think that the usability of the CFBot website would be improved > if there was a better visual indicator of what each "green tick inside > a circle" link actually indicates -- what are we testing for each > green tick/red X shown? > > I already see tooltips which show a descriptive string (for example a > tooltip that says "FreeBSD - 13: COMPLETED" which comes from > tags), which is something. But seeing these tooltips > requires several seconds of mouseover on my browser (Chrome). I'd be > quite happy if I could see similar tooltips immediately on mouseover > (which isn't actually possible with standard generic tooltips IIUC), > or something equivalent. Any kind of visual feedback on the nature of > the thing tested by a particular CI run that the user can drill down > to (you know, a Debian logo next to the tick, that kind of thing). Nice idea, if someone with graphics skills is interested in looking into it... Those tooltips come from the "name" elements of the .cirrus.yml file where tasks are defined, with Cirrus's task status appended. If we had a set of monochrome green and red icons with a Linux penguin, FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a config file could map task names to icons, and fall back to ticks/crosses for anything unknown/new, including the "CompilerWarnings" one that doesn't have an obvious icon. Another thing to think about is the 'solid' and 'hollow' variants, the former indicating a recent change. So we'd need 4 variants of each logo. Also I believe there is a proposal to add NetBSD and OpenBSD in the works.
Re: Probable CF bot degradation
On Mon, Mar 21, 2022 at 3:11 PM Andres Freund wrote: > On 2022-03-21 14:44:55 +1300, Thomas Munro wrote: > > Those tooltips come from the "name" elements of the .cirrus.yml file > > where tasks are defined, with Cirrus's task status appended. If we > > had a set of monochrome green and red icons with a Linux penguin, > > FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a > > config file could map task names to icons, and fall back to > > ticks/crosses for anything unknown/new, including the > > "CompilerWarnings" one that doesn't have an obvious icon. Another > > thing to think about is the 'solid' and 'hollow' variants, the former > > indicating a recent change. So we'd need 4 variants of each logo. > > Also I believe there is a proposal to add NetBSD and OpenBSD in the > > works. > > Might even be sufficient to add just the first letter of the task inside the > circle, instead of the "check" and x. Right now the letters are unique. Nice idea, because it retains the information density. If someone with web skills would like to pull down the cfbot page and hack up one of the rows to show an example of a pass, fail, recent-pass, recent-fail as a circle with a letter in it, and also an "in progress" symbol that occupies the same amoutn of space, I'd be keen to try that. (The current "in progress" blue circle was originally supposed to be a pie filling up slowly according to a prediction of finished time based on past performance, but I never got to that... it's stuck at 1/4 :-))
Re: [PATCH] add relation and block-level filtering to pg_waldump
(defaults to 0, the main fork)\n")); + printf(_(" -l, --relation=N/N/N only show records that touch a specific relation\n")); printf(_(" -n, --limit=N number of records to display\n")); printf(_(" -p, --path=PATHdirectory in which to find log segment files or a\n" " directory with a ./pg_wal that contains such files\n" @@ -777,6 +837,7 @@ usage(void) " (default: 1 or the value used in STARTSEG)\n")); printf(_(" -V, --version output version information, then exit\n")); printf(_(" -x, --xid=XID only show records with transaction ID XID\n")); + printf(_(" -w, --fullpage only show records with a full page write\n")); printf(_(" -z, --stats[=record] show statistics instead of records\n" " (optionally, show per-record statistics)\n")); printf(_(" -?, --help show this help, then exit\n")); @@ -800,12 +861,16 @@ main(int argc, char **argv) static struct option long_options[] = { {"bkp-details", no_argument, NULL, 'b'}, + {"block", required_argument, NULL, 'k'}, {"end", required_argument, NULL, 'e'}, {"follow", no_argument, NULL, 'f'}, + {"fork", required_argument, NULL, 1}, + {"fullpage", no_argument, NULL, 'w'}, {"help", no_argument, NULL, '?'}, {"limit", required_argument, NULL, 'n'}, {"path", required_argument, NULL, 'p'}, {"quiet", no_argument, NULL, 'q'}, + {"relation", required_argument, NULL, 'l'}, {"rmgr", required_argument, NULL, 'r'}, {"start", required_argument, NULL, 's'}, {"timeline", required_argument, NULL, 't'}, @@ -858,6 +923,10 @@ main(int argc, char **argv) config.filter_by_rmgr_enabled = false; config.filter_by_xid = InvalidTransactionId; config.filter_by_xid_enabled = false; + config.filter_by_relation_enabled = false; + config.filter_by_relation_block_enabled = false; + config.filter_by_relation_forknum = MAIN_FORKNUM; + config.filter_by_fpw = false; config.stats = false; config.stats_per_record = false; @@ -870,7 +939,7 @@ main(int argc, char **argv) goto bad_argument; } - while ((option = getopt_long(argc, argv, "be:fn:p:qr:s:t:x:z", + while ((option = getopt_long(argc, argv, "be:fk:l:n:p:qr:s:t:wx:z", long_options, &optindex)) != -1) { switch (option) @@ -890,6 +959,41 @@ main(int argc, char **argv) case 'f': config.follow = true; break; + case 1: /* fork number */ +if (sscanf(optarg, "%u", &config.filter_by_relation_forknum) != 1 || + config.filter_by_relation_forknum >= MAX_FORKNUM) +{ + pg_log_error("could not parse valid fork number (0..%d) \"%s\"", + MAX_FORKNUM - 1, optarg); + goto bad_argument; +} +break; + case 'k': +if (sscanf(optarg, "%u", &config.filter_by_relation_block) != 1 || + !BlockNumberIsValid(config.filter_by_relation_block)) +{ + pg_log_error("could not parse valid block number \"%s\"", optarg); + goto bad_argument; +} +config.filter_by_relation_block_enabled = true; +break; + case 'l': +if (sscanf(optarg, "%u/%u/%u", + &config.filter_by_relation.spcNode, + &config.filter_by_relation.dbNode, + &config.filter_by_relation.relNode) != 3 || + !OidIsValid(config.filter_by_relation.spcNode) || + !OidIsValid(config.filter_by_relation.dbNode) || + !OidIsValid(config.filter_by_relation.relNode) + ) +{ + pg_log_error("could not parse valid relation from \"%s\"/" + " (expecting \"tablespace OID/database OID/" + "relation filenode\")", optarg); + goto bad_argument; +} +config.filter_by_relation_enabled = true; +break; case 'n': if (sscanf(optarg, "%d", &config.stop_after_records) != 1) { @@ -947,6 +1051,9 @@ main(int argc, char **argv) goto bad_argument; } break; + case 'w': +config.filter_by_fpw = true; +break; case 'x': if (sscanf(optarg, "%u", &config.filter_by_xid) != 1) { @@ -976,6 +1083,12 @@ main(int argc, char **argv) } } + if (config.filter_by_relation_block_enabled && !config.filter_by_relation_enabled) + { + pg_log_error("--block option requires --relation option to be specified"); + goto bad_argument; + } + if ((optind + 2) < argc) {
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Mon, Mar 21, 2022 at 4:36 PM Thomas Munro wrote: > On Sat, Feb 26, 2022 at 7:58 AM David Christensen > wrote: > > Attached is V2 with additional feedback from this email, as well as the > > specification of the > > ForkNumber and FPW as specifiable options. > > Trivial fixup needed after commit 3f1ce973. [04:30:50.630] pg_waldump.c:963:26: error: format ‘%u’ expects argument of type ‘unsigned int *’, but argument 3 has type ‘ForkNumber *’ [-Werror=format=] [04:30:50.630] 963 | if (sscanf(optarg, "%u", &config.filter_by_relation_forknum) != 1 || [04:30:50.630] | ~^ ~~ [04:30:50.630] | | | [04:30:50.630] | | ForkNumber * [04:30:50.630] | unsigned int * And now that this gets to the CompilerWarnings CI task, it looks like GCC doesn't like an enum as a scanf %u destination (I didn't see that warning locally when I compiled the above fixup because clearly Clang is cool with it...). Probably needs a temporary unsigned int to sscanf into first.
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Tue, Mar 22, 2022 at 6:14 AM David Christensen wrote: > Updated to include the V3 fixes as well as the unsigned int/enum fix. Hi David, I ran this though pg_indent and adjusted some remaining non-project-style whitespace, and took it for a spin. Very minor comments: pg_waldump: error: could not parse valid relation from ""/ (expecting "tablespace OID/database OID/relation filenode") -> There was a stray "/" in that message, which I've removed in the attached. pg_waldump: error: could not parse valid relation from "1664/0/1262" (expecting "tablespace OID/database OID/relation filenode") -> Why not? Shared relations like pg_database have invalid database OID, so I think it should be legal to write --relation=1664/0/1262. I took out that restriction. + if (sscanf(optarg, "%u", &forknum) != 1 || + forknum >= MAX_FORKNUM) + { + pg_log_error("could not parse valid fork number (0..%d) \"%s\"", + MAX_FORKNUM - 1, optarg); + goto bad_argument; + } I guess you did this because init fork references aren't really expected in the WAL, but I think it's more consistent to allow up to MAX_FORKNUM, not least because your documentation mentions 3 as a valid value. So I adjust this to allow MAX_FORKNUM. Make sense? Here are some more details I noticed, as a likely future user of this very handy feature, which I haven't changed, because they seem more debatable and you might disagree... 1. I think it'd be less surprising if the default value for --fork wasn't 0... why not show all forks? 2. I think it'd be less surprising if --fork without --relation either raised an error (like --block without --relation), or were allowed, with the meaning "show me this fork of all relations". 3. It seems funny to have no short switch for --fork when everything else has one... what about -F? From 596181e9dfe2db9d5338b3ac3c899d230fe0fc78 Mon Sep 17 00:00:00 2001 From: David Christensen Date: Fri, 25 Feb 2022 12:52:56 -0600 Subject: [PATCH v5] Add additional filtering options to pg_waldump. This feature allows you to only output records that are touch a specific RelFileNode and optionally BlockNumber and ForkNum in this relation. We also add the independent ability to filter Full Page Write records. Author: David Christensen Reviewed-by: Peter Geoghegan Reviewed-by: Japin Li Reviewed-by: Bharath Rupireddy Reviewed-by: Cary Huang Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/lzzgmgm6e5.fsf%40veeddrois.attlocal.net --- doc/src/sgml/ref/pg_waldump.sgml | 48 +++ src/bin/pg_waldump/pg_waldump.c | 132 ++- 2 files changed, 179 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml index 5735a161ce..f157175764 100644 --- a/doc/src/sgml/ref/pg_waldump.sgml +++ b/doc/src/sgml/ref/pg_waldump.sgml @@ -100,6 +100,44 @@ PostgreSQL documentation + + -k block + --block=block + + +Display only records touching the given block. (Requires also +providing the relation via --relation.) + + + + + + --fork=fork + + +When using the --relation filter, output only records +from the given fork. The valid values here are 0 +for the main fork, 1 for the Free Space +Map, 2 for the Visibility Map, +and 3 for the Init fork. If unspecified, defaults +to the main fork. + + + + + + -l tbl/db/rel + --relation=tbl/db/rel + + +Display only records touching the given relation. The relation is +specified via tablespace OID, database OID, and relfilenode separated +by slashes, for instance, 1234/12345/12345. This +is the same format used for relations in the WAL dump output. + + + + -n limit --limit=limit @@ -183,6 +221,16 @@ PostgreSQL documentation + + -w + --fullpage + + + Filter records to only those which have full page writes. + + + + -x xid --xid=xid diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index fc081adfb8..eb0c9b2dcb 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -55,6 +55,12 @@ typedef struct XLogDumpConfig bool filter_by_rmgr_enabled; TransactionId filter_by_xid; bool filter_by_xid_enabled; + RelFileNode filter_by_relation; + bool filter_by_relation_enabled; + BlockNumber filter_by_relation_block; + bool
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
I have a new socket abstraction patch that should address the known Windows socket/event bugs, but it's a little bigger than I thought it would be, not quite ready, and now too late to expect people to review for 15, so I think it should go into the next cycle. I've bounced https://commitfest.postgresql.org/37/3523/ into the next CF. We'll need to do something like 75674c7e for master.
Broken make dependency somewhere near win32ver.o?
Hi, Here's a strange one-off failure seen on CI[1], in the CompilerWarnings task where we check that mingw cross-compile works: [10:48:29.045] time make -s -j${BUILD_JOBS} world-bin [10:48:38.705] x86_64-w64-mingw32-gcc: error: win32ver.o: No such file or directory [10:48:38.705] make[3]: *** [Makefile:44: pg_dumpall] Error 1 [10:48:38.705] make[3]: *** Waiting for unfinished jobs [10:48:38.709] make[2]: *** [Makefile:43: all-pg_dump-recurse] Error 2 [10:48:38.709] make[2]: *** Waiting for unfinished jobs [10:48:38.918] make[1]: *** [Makefile:42: all-bin-recurse] Error 2 [10:48:38.918] make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2 I guess this implies a dependency problem somewhere around src/makefiles/Makefile.win32 but I'm not familiar with how that .rc stuff is supposed to work and I figured I'd mention it here in case it's obvious to someone else... [1] https://cirrus-ci.com/task/5546921619095552
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
On Tue, Mar 22, 2022 at 4:13 PM Tom Lane wrote: > Thomas Munro writes: > > I have a new socket abstraction patch that should address the known > > Windows socket/event bugs, but it's a little bigger than I thought it > > would be, not quite ready, and now too late to expect people to review > > for 15, so I think it should go into the next cycle. I've bounced > > https://commitfest.postgresql.org/37/3523/ into the next CF. We'll > > need to do something like 75674c7e for master. > > OK. You want me to push 75674c7e to HEAD? Thanks, yes, please do.
Re: A test for replay of regression tests
On Tue, Mar 22, 2022 at 4:31 PM Masahiko Sawada wrote: > SELECT pg_relation_size('vac_truncate_test') = 0; > ?column? > -- > - t > + f Thanks. Ahh, déjà vu... this probably needs the same treatment as b700f96c and 3414099c provided for the reloptions test. Well, at least the first one. Here's a patch like that. From ba05f03c202bf66c7692787ec24ece13b193a897 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 22 Mar 2022 16:54:42 +1300 Subject: [PATCH] Try to stabilize vacuum test. As commits b700f96c and 3414099c did for the reloptions test, take measures to make sure that vacuum always truncates the table. Discussion: https://postgr.es/m/CAD21AoCNoWjYkdEtr%2BVDoF9v__V905AedKZ9iF%3DArgCtrbxZqw%40mail.gmail.com --- src/test/regress/expected/vacuum.out | 6 +++--- src/test/regress/sql/vacuum.sql | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out index 3e70e4c788..c63a157e5f 100644 --- a/src/test/regress/expected/vacuum.out +++ b/src/test/regress/expected/vacuum.out @@ -165,19 +165,19 @@ VACUUM (INDEX_CLEANUP FALSE) vaccluster; VACUUM (INDEX_CLEANUP AUTO) vactst; -- index cleanup option is ignored if no indexes VACUUM (INDEX_CLEANUP FALSE, FREEZE TRUE) vaccluster; -- TRUNCATE option -CREATE TABLE vac_truncate_test(i INT NOT NULL, j text) +CREATE TEMP TABLE vac_truncate_test(i INT NOT NULL, j text) WITH (vacuum_truncate=true, autovacuum_enabled=false); INSERT INTO vac_truncate_test VALUES (1, NULL), (NULL, NULL); ERROR: null value in column "i" of relation "vac_truncate_test" violates not-null constraint DETAIL: Failing row contains (null, null). -VACUUM (TRUNCATE FALSE) vac_truncate_test; +VACUUM (TRUNCATE FALSE, DISABLE_PAGE_SKIPPING) vac_truncate_test; SELECT pg_relation_size('vac_truncate_test') > 0; ?column? -- t (1 row) -VACUUM vac_truncate_test; +VACUUM (DISABLE_PAGE_SKIPPING) vac_truncate_test; SELECT pg_relation_size('vac_truncate_test') = 0; ?column? -- diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql index 18cb7fd08a..9faa8a34a6 100644 --- a/src/test/regress/sql/vacuum.sql +++ b/src/test/regress/sql/vacuum.sql @@ -147,12 +147,12 @@ VACUUM (INDEX_CLEANUP AUTO) vactst; -- index cleanup option is ignored if no ind VACUUM (INDEX_CLEANUP FALSE, FREEZE TRUE) vaccluster; -- TRUNCATE option -CREATE TABLE vac_truncate_test(i INT NOT NULL, j text) +CREATE TEMP TABLE vac_truncate_test(i INT NOT NULL, j text) WITH (vacuum_truncate=true, autovacuum_enabled=false); INSERT INTO vac_truncate_test VALUES (1, NULL), (NULL, NULL); -VACUUM (TRUNCATE FALSE) vac_truncate_test; +VACUUM (TRUNCATE FALSE, DISABLE_PAGE_SKIPPING) vac_truncate_test; SELECT pg_relation_size('vac_truncate_test') > 0; -VACUUM vac_truncate_test; +VACUUM (DISABLE_PAGE_SKIPPING) vac_truncate_test; SELECT pg_relation_size('vac_truncate_test') = 0; VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test; DROP TABLE vac_truncate_test; -- 2.30.2
Re: New Object Access Type hooks
On Fri, Mar 18, 2022 at 4:22 PM Mark Dilger wrote: > (FYI, I got a test failure from src/test/recovery/t/013_crash_restart.pl when > testing v1-0001. I'm not sure yet what that is about.) Doesn't look like 0001 has anything to do with that... Are you on a Mac? Did it look like this recent failure from CI? https://cirrus-ci.com/task/4686108033286144 https://api.cirrus-ci.com/v1/artifact/task/4686108033286144/log/src/test/recovery/tmp_check/log/regress_log_013_crash_restart https://api.cirrus-ci.com/v1/artifact/task/4686108033286144/log/src/test/recovery/tmp_check/log/013_crash_restart_primary.log I have no idea what is going on there, but searching for discussion brought me here...
Re: Broken make dependency somewhere near win32ver.o?
On Tue, Mar 22, 2022 at 4:14 PM Andres Freund wrote: > The problem looks to be that pg_dumpall doesn't have a dependency on OBJs, > which in turn is what contains the dependency on WIN32RES, which in turn > contains win32ver.o. So the build succeeds if pg_dump/restores's dependencies > are built first, but not if pg_dumpall starts to be built before that... > > Seems we just need to add $(WIN32RES) to pg_dumpall: ? Ah, yeah, that looks right. I don't currently have a mingw setup to test, but clearly $(WIN32RES) is passed to $(CC) despite not being listed as a dependency.
Re: MDAM techniques and Index Skip Scan patch
On Tue, Mar 22, 2022 at 2:34 PM Andres Freund wrote: > IMO it's pretty clear that having "duelling" patches below one CF entry is a > bad idea. I think they should be split, with inactive approaches marked as > returned with feeback or whatnot. I have the impression that this thread is getting some value from having a CF entry, as a multi-person collaboration where people are trading ideas and also making progress that no one wants to mark as returned, but it's vexing for people managing the CF because it's not really proposed for 15. Perhaps what we lack is a new status, "Work In Progress" or something?
Re: Probable CF bot degradation
On Mon, Mar 21, 2022 at 12:46 PM Thomas Munro wrote: > On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro wrote: > > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent > > wrote: > > > Additionally, are there plans to validate commits of the main branch > > > before using them as a base for CF entries, so that "bad" commits on > > > master won't impact CFbot results as easy? > > > > How do you see this working? > > [Now with more coffee on board] Oh, right, I see, you're probably > thinking that we could look at > https://github.com/postgres/postgres/commits/master and take the most > recent passing commit as a base. Hmm, interesting idea. A nice case in point today: everything is breaking on Windows due to a commit in master, which could easily be avoided by looking back a certain distance for a passing commit from postgres/postgres to use as a base. Let's me see if this is easy to fix... https://www.postgresql.org/message-id/20220322231311.GK28503%40telsasoft.com
Re: cpluspluscheck vs ICU
On Wed, Mar 23, 2022 at 3:23 PM Andres Freund wrote: > On 2022-03-22 17:20:24 -0700, Andres Freund wrote: > > I was about to propose adding headerscheck / cpluspluscheck to the CI file > > so > > that cfbot can catch future issues. > > The attached patch does so, with ICU disabled to avoid the problems discussed > in the thread. Example run: > https://cirrus-ci.com/task/6326161696358400?logs=headers_headerscheck#L0 > > Unless somebody sees a reason not to, I'm planning to commit this soon. LGTM.
Re: [PATCH] Add native windows on arm64 support
On Tue, Mar 22, 2022 at 11:30 PM Julien Rouhaud wrote: > On Tue, Mar 22, 2022 at 09:37:46AM +, Niyas Sait wrote: > > Yes, we could look into providing a build machine. Do you have any > > reference to what the CI system looks like now for PostgresSQL and how to > > add new workers etc.? > > It's all documented at > https://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_Howto. It seems likely there will be more and more Windows/ARM users, so yeah having a machine to test that combination would be great. I wonder if ASLR isn't breaking for you by good luck only...
Re: Adding CI to our tree
On Thu, Mar 10, 2022 at 9:37 AM Justin Pryzby wrote: > -Og Adding this to CXXFLAGS caused a torrent of warnings from g++ about LLVM headers, which I also see on my local system for LLVM 11 and LLVM 14: [19:47:11.047] /usr/lib/llvm-11/include/llvm/ADT/Twine.h: In member function ‘llvm::CallInst* llvm::IRBuilderBase::CreateCall(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef, const llvm::Twine&, llvm::MDNode*)’: [19:47:11.047] /usr/lib/llvm-11/include/llvm/ADT/Twine.h:229:16: warning: ‘.llvm::Twine::LHS.llvm::Twine::Child::twine’ may be used uninitialized in this function [-Wmaybe-uninitialized] [19:47:11.047] 229 | !LHS.twine->isBinary()) [19:47:11.047] | ^ https://cirrus-ci.com/task/5597526098182144?logs=build#L6 Not sure who to complain to about that...
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 24, 2022 at 9:59 AM Peter Geoghegan wrote: > On Wed, Mar 23, 2022 at 1:53 PM Robert Haas wrote: > > And therefore I favor defining it to mean that we don't > > skip any work at all. > > But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot > acquire a cleanup lock on a page, unless it happens to have XIDs from > before FreezeLimit (which is probably 50 million XIDs behind > OldestXmin, the vacuum_freeze_min_age default). I don't see much > difference. Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable all page skipping, so 3414099c turned out to be not enough.
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Thu, Mar 24, 2022 at 9:53 AM Peter Eisentraut wrote: > On 21.03.22 05:55, Thomas Munro wrote: > > [04:30:50.630] pg_waldump.c:963:26: error: format ‘%u’ expects > > argument of type ‘unsigned int *’, but argument 3 has type ‘ForkNumber > > *’ [-Werror=format=] > > [04:30:50.630] 963 | if (sscanf(optarg, "%u", > > &config.filter_by_relation_forknum) != 1 || > > [04:30:50.630] | ~^ ~~ > > [04:30:50.630] | | | > > [04:30:50.630] | | ForkNumber * > > [04:30:50.630] | unsigned int * > > > > And now that this gets to the CompilerWarnings CI task, it looks like > > GCC doesn't like an enum as a scanf %u destination (I didn't see that > > warning locally when I compiled the above fixup because clearly Clang > > is cool with it...). Probably needs a temporary unsigned int to > > sscanf into first. > > That's because ForkNum is a signed type. You will probably succeed if > you use "%d" instead. Erm, is that really OK? C says "Each enumerated type shall be compatible with char, a signed integer type, or an unsigned integer type. The choice of type is implementation-defined, but shall be capable of representing the values of all the members of the enumeration." It could even legally vary from enum to enum, though in practice most compilers probably just use ints all the time unless you use weird pragma pack incantation. Therefore I think you need an intermediate variable with the size and signedness matching the format string, if you're going to scanf directly into it, which David's V6 did.
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Wed, Feb 16, 2022 at 12:11 PM Nathan Bossart wrote: > On Tue, Feb 15, 2022 at 10:37:58AM -0800, Nathan Bossart wrote: > > On Tue, Feb 15, 2022 at 10:10:34AM -0800, Andres Freund wrote: > >> I generally think it'd be a good exercise to go through an use > >> get_dirent_type() in nearly all ReadDir() style loops - it's a nice > >> efficiency > >> win in general, and IIRC a particularly big one on windows. > >> > >> It'd probably be good to add a reference to get_dirent_type() to > >> ReadDir[Extended]()'s docs. > > > > Agreed. I might give this a try. > > Alright, here is a new patch set where I've tried to replace as many > stat()/lstat() calls as possible with get_dirent_type(). 0002 and 0003 are > the same as v9. I noticed a few remaining stat()/lstat() calls that don't > appear to be doing proper error checking, but I haven't had a chance to try > fixing those yet. 0001: These get_dirent_type() conversions are nice. LGTM. 0002: /* we're only handling directories here, skip if it's not ours */ -if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode)) +if (lstat(path, &statbuf) != 0) +ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", path))); +else if (!S_ISDIR(statbuf.st_mode)) return; Why is this a good place to silently ignore non-directories? StartupReorderBuffer() is already in charge of skipping random detritus found in the directory, so would it be better to do "if (get_dirent_type(...) != PGFILETYPE_DIR) continue" there, and then drop the lstat() stanza from ReorderBufferCleanupSeralizedTXNs() completely? Then perhaps its ReadDirExtended() shoud be using ERROR instead of INFO, so that missing/non-dir/b0rked directories raise an error. I don't understand why it's reporting readdir() errors at INFO but unlink() errors at ERROR, and as far as I can see the other paths that reach this code shouldn't be sending in paths to non-directories here unless something is seriously busted and that's ERROR-worthy. 0003: I haven't studied this cure vs disease thing... no opinion today.
Checking pgwin32_is_junction() errors
Hi, The comment for pgwin32_is_junction() says "Assumes the file exists, so will return false if it doesn't (since a nonexistent file is not a junction)". In fact that's the behaviour for any kind of error, and although we set errno in that case, no caller ever checks it. I think it'd be better to add missing_ok and elevel parameters, following existing patterns. Unfortunately, it can't use the generic frontend logging to implement elevel in frontend code from its current location, because pgport can't call pgcommon. For now I came up with a kludge to work around that problem, but I don't like it, and would need to come up with something better... Sketch code attached. From 8e0e51359c0191cf673c3ddc87a3262b13f530a4 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 16 Mar 2022 23:05:39 +1300 Subject: [PATCH] Raise errors in pgwin32_is_junction(). XXX It's not very nice to use elevel like this in frontend code --- src/backend/access/transam/xlog.c| 2 +- src/backend/replication/basebackup.c | 4 ++-- src/backend/storage/file/fd.c| 2 +- src/backend/utils/adt/misc.c | 2 +- src/bin/pg_checksums/pg_checksums.c | 2 +- src/bin/pg_rewind/file_ops.c | 2 +- src/common/file_utils.c | 5 +++- src/include/port.h | 2 +- src/include/port/win32_port.h| 2 +- src/port/dirmod.c| 36 +--- 10 files changed, 46 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4ac3871c74..819b05177d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8314,7 +8314,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, * directories directly under pg_tblspc, which would fail below. */ #ifdef WIN32 - if (!pgwin32_is_junction(fullpath)) + if (!pgwin32_is_junction(fullpath, false, ERROR)) continue; #else if (get_dirent_type(fullpath, de, false, ERROR) != PGFILETYPE_LNK) diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 6884cad2c0..4be9090445 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1352,7 +1352,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, #ifndef WIN32 S_ISLNK(statbuf.st_mode) #else - pgwin32_is_junction(pathbuf) + pgwin32_is_junction(pathbuf, true, ERROR) #endif ) { @@ -1840,7 +1840,7 @@ convert_link_to_directory(const char *pathbuf, struct stat *statbuf) #ifndef WIN32 if (S_ISLNK(statbuf->st_mode)) #else - if (pgwin32_is_junction(pathbuf)) + if (pgwin32_is_junction(pathbuf, true, ERROR)) #endif statbuf->st_mode = S_IFDIR | pg_dir_create_mode; } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 14b77f2861..ec2f49fefa 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -3453,7 +3453,7 @@ SyncDataDirectory(void) xlog_is_symlink = true; } #else - if (pgwin32_is_junction("pg_wal")) + if (pgwin32_is_junction("pg_wal", false, LOG)) xlog_is_symlink = true; #endif diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 89690be2ed..723504ab10 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -317,7 +317,7 @@ pg_tablespace_location(PG_FUNCTION_ARGS) * found, a relative path to the data directory is returned. */ #ifdef WIN32 - if (!pgwin32_is_junction(sourcepath)) + if (!pgwin32_is_junction(sourcepath, false, ERROR)) PG_RETURN_TEXT_P(cstring_to_text(sourcepath)); #else if (lstat(sourcepath, &st) < 0) diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c index 5f0f5ee62d..5720907d33 100644 --- a/src/bin/pg_checksums/pg_checksums.c +++ b/src/bin/pg_checksums/pg_checksums.c @@ -404,7 +404,7 @@ scan_directory(const char *basedir, const char *subdir, bool sizeonly) #ifndef WIN32 else if (S_ISDIR(st.st_mode) || S_ISLNK(st.st_mode)) #else - else if (S_ISDIR(st.st_mode) || pgwin32_is_junction(fn)) + else if (S_ISDIR(st.st_mode) || pgwin32_is_junction(fn, false, 1)) #endif { /* diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c index 6cb288f099..1eda2baf2f 100644 --- a/src/bin/pg_rewind/file_ops.c +++ b/src/bin/pg_rewind/file_ops.c @@ -434,7 +434,7 @@ recurse_dir(const char *datadir, const char *parentpath, #ifndef WIN32 else if (S_ISLNK(fst.st_mode)) #else - else if (pgwin32_is_junction(fullpath)) + else if (pgwin32_is_junction(fullpath, false, 1)) #endif { #if defined(HAVE_READLINK) || defined(WIN32) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 7138068633..2db909a3aa 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -88,8 +88,11 @@ fsync_pgdata(const char *pg_data,
Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
On Mon, Nov 30, 2020 at 6:59 PM Thomas Munro wrote: > On Wed, Nov 25, 2020 at 8:00 AM Pavel Borisov wrote: > > The new status of this patch is: Ready for Committer > ... That's the version I plan to commit tomorrow, unless > there are further comments or objections. ... Done, and back-patched. I thought a bit more about the fact that we fail to unlink higher-numbered segments in certain error cases, potentially leaving stray files behind. As far as I can see, nothing we do in this code-path is going to be a bullet-proof solution to that problem. One simple idea would be for the checkpointer to refuse to unlink segment 0 (thereby allowing the relfilenode to be recycled) until it has scanned the parent directory for any related files that shouldn't be there. > While looking at trace output, I figured we should just use > truncate(2) on non-Windows, on the master branch only. It's not like > it really makes much difference, but I don't see why we shouldn't > allow ourselves to use ancient standardised Unix syscalls when we can. Also pushed, but only to master.
Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
On Tue, Dec 1, 2020 at 3:55 PM Michael Paquier wrote: > On Mon, Nov 30, 2020 at 06:59:40PM +1300, Thomas Munro wrote: > > So I fixed that, by adding a return value to do_truncate() and > > checking it. That's the version I plan to commit tomorrow, unless > > there are further comments or objections. I've also attached a > > version suitable for REL_11_STABLE and earlier branches (with a name > > that cfbot should ignore), where things are slightly different. In > > those branches, the register_forget_request() logic is elsewhere. > > Hmm. Sorry for arriving late at the party. But is that really > something suitable for a backpatch? Sure, it is not optimal to not > truncate all the segments when a transaction dropping a relation > commits, but this was not completely broken either. I felt on balance it was a "bug", since it causes operational difficulties for people and was clearly not our intended behaviour, and I announced this intention 6 weeks ago. Of course I'll be happy to revert it from the back-branches if that's the consensus. Any other opinions?
PG vs LLVM 12 on seawasp, next round
Hi Since seawasp's bleeding edge LLVM installation moved to "trunk 20201114 c8f4e06b 12.0.0" ~16 days ago, it has been red. Further updates didn't help it and it's now on "trunk 20201127 6ee22ca6 12.0.0". I wonder if there is something in Fabien's scripting that needs to be tweaked, perhaps a symlink name or similar. I don't follow LLVM development but I found my way to a commit[1] around the right time that mentions breaking up the OrcJIT library, so *shrug* maybe that's a clue. +ERROR: could not load library "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so": libLLVMOrcJIT.so.12git: cannot open shared object file: No such file or directory [1] https://github.com/llvm/llvm-project/commit/1d0676b54c4e3a517719220def96dfdbc26d8048
Re: Recent eelpout failures on 9.x branches
On Wed, Dec 2, 2020 at 11:36 AM Tom Lane wrote: > Perhaps this is just a question of the machine being too slow to complete > the test, in which case we ought to raise wal_sender_timeout. But it's > weird that it would've started to fail just now, because I don't really > see any changes in those branches that would explain a week-old change > in the test runtime. Unfortunately, eelpout got kicked off the nice shiny ARM server it was running on so last week I moved it to a rack mounted Raspberry Pi. It seems to be totally I/O starved causing some timeouts to be reached, and I'm looking into fixing that by adding fast storage. This may take a few days. Sorry for the noise.
Re: Recent eelpout failures on 9.x branches
On Wed, Dec 2, 2020 at 12:07 PM Tom Lane wrote: > I'm also wondering a bit why the issue isn't affecting the newer > branches. It's certainly not because we made the test shorter ... I looked at htop while it was building the 9.x branches and saw pg_basebackup sitting in D state waiting for glacial I/O. Perhaps this was improved by the --no-sync option that got sprinkled around the place in later releases?
Re: Commitfest 2020-11 is closed
On Thu, Dec 3, 2020 at 10:00 AM Tom Lane wrote: > This is actually a bit problematic, because now the cfbot is ignoring > those patches (or if it's not, I don't know where it's displaying the > results). Please go ahead and move the remaining open patches, or > else re-open the CF if that's possible. As of quite recently, Travis CI doesn't seem to like cfbot's rate of build jobs. Recently it's been taking a very long time to post results for new patches and taking so long to get around to retesting patches for bitrot that the my "delete old results after a week" logic started wiping out some results while they are still the most recent, leading to the blank spaces where the results are supposed to be. D'oh. I'm looking into a couple of options.
Re: Commitfest 2020-11 is closed
On Thu, Dec 3, 2020 at 10:51 AM David Steele wrote: > On 12/2/20 4:15 PM, Thomas Munro wrote: > > On Thu, Dec 3, 2020 at 10:00 AM Tom Lane wrote: > >> This is actually a bit problematic, because now the cfbot is ignoring > >> those patches (or if it's not, I don't know where it's displaying the > >> results). Please go ahead and move the remaining open patches, or > >> else re-open the CF if that's possible. > > > > As of quite recently, Travis CI doesn't seem to like cfbot's rate of > > build jobs. Recently it's been taking a very long time to post > > results for new patches and taking so long to get around to retesting > > patches for bitrot that the my "delete old results after a week" logic > > started wiping out some results while they are still the most recent, > > leading to the blank spaces where the results are supposed to be. > > D'oh. I'm looking into a couple of options. > > pgBackRest test runs have gone from ~17 minutes to 3-6 hours over the > last two months. Ouch. > Also keep in mind that travis-ci.org will be shut down at the end of the > month. Some users who have migrated to travis-ci.com have complained > that it is not any faster, but I have not tried myself (yet). Oh. > Depending on how you have Github organized migrating to travis-ci.com > may be bit tricky because it requires full access to all private > repositories in your account and orgs administrated by your account. I'm experimenting with Github's built in CI. All other ideas welcome.
Github Actions (CI)
Hi hackers, I'm looking for more horsepower for testing commitfest entries automatically, and today I tried out $SUBJECT. The attached is a rudimentary first attempt, for show-and-tell. If you have a Github account, you just have to push it to a branch there and look at the Actions tab on the web page for the results. Does anyone else have .github files and want to share, to see if we can combine efforts here? The reason for creating three separate "workflows" for Linux, Windows and macOS rather than three separate "jobs" inside one workflow is so that cfbot.cputube.org could potentially get separate pass/fail results for each OS out of the API rather than one combined result. I rather like that feature of cfbot's results. (I could be wrong about needing to do that, this is the first time I've ever looked at this stuff.) The Windows test actually fails right now, exactly as reported by Ranier[1]. It is a release build on a recent MSVC, so I guess that is expected and off-topic for this thread. But generally, .github/workflows/ci-windows.yml is the weakest part of this. It'd be great to get a debug/assertion build, show backtraces when it crashes, run more of the tests, etc etc, but I don't know nearly enough about Windows to do that myself. Another thing is that it uses Choco for flex and bison; it'd be better to find those on the image, if possible. Also, for all 3 OSes, it's not currently attempting to cache build results or anything like that. I'm a bit sad that GH doesn't have FreeBSD build runners. Those are now popping up on other CIs, but I'm not sure if their free/open source tiers have enough resources for cfbot. [1] https://www.postgresql.org/message-id/flat/CAEudQArhn8bH836OB%2B3SboiaeEcgOtrJS58Bki4%3D5yeVqToxgw%40mail.gmail.com From 424bcae7f1fcb1ada0e7046bfc5e0c5254c6f439 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 3 Dec 2020 10:58:16 +1300 Subject: [PATCH] Github CI WIP --- .github/workflows/ci-linux.yml | 43 ++ .github/workflows/ci-macos.yml | 38 +++ .github/workflows/ci-windows-buildsetup.pl | 35 ++ .github/workflows/ci-windows-dumpregr.pl | 22 +++ .github/workflows/ci-windows.yml | 32 5 files changed, 170 insertions(+) create mode 100644 .github/workflows/ci-linux.yml create mode 100644 .github/workflows/ci-macos.yml create mode 100644 .github/workflows/ci-windows-buildsetup.pl create mode 100644 .github/workflows/ci-windows-dumpregr.pl create mode 100644 .github/workflows/ci-windows.yml diff --git a/.github/workflows/ci-linux.yml b/.github/workflows/ci-linux.yml new file mode 100644 index 00..8ee32770cd --- /dev/null +++ b/.github/workflows/ci-linux.yml @@ -0,0 +1,43 @@ +name: ci-linux +on: [push] +jobs: + ci-linux: +runs-on: ubuntu-latest +steps: +- uses: actions/checkout@v2 +- name: Install packages + run: | +sudo apt-get --yes update +sudo apt-get --yes install gcc libreadline-dev flex bison make perl libipc-run-perl clang llvm-dev libperl-dev libpython-dev tcl-dev libldap2-dev libicu-dev docbook-xml docbook-xsl fop libxml2-utils xsltproc krb5-admin-server krb5-kdc krb5-user slapd ldap-utils libssl-dev pkg-config locales-all gdb + env: +DEBIAN_FRONTEND: noninteractive +- name: Configure + run: ./configure --enable-cassert --enable-debug --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap --with-openssl --with-icu --with-llvm +- name: Build + run: | +echo "COPT=-Wall -Werror" > src/Makefile.custom +make -s -j3 +- name: Check world + run: | +echo '/tmp/%e-%s-%p.core' | sudo tee /proc/sys/kernel/core_pattern +ulimit -c unlimited +make -s check-world + env: +PG_TEST_EXTRA: "ssl kerberos" +#PG_TEST_EXTRA: "ssl ldap kerberos" why does slapd fail to start? +- name: Look for clues + if: ${{ failure() }} + run: | +# dump useful logs +for F in ` find . -name initdb.log -o -name regression.diffs ` ; do + echo === $F === + head -1000 $F +done +# look for core files and spit out backtraces +for corefile in $(find /tmp/ -name '*.core' 2>/dev/null) ; do + binary=$(gdb -quiet -core $corefile -batch -ex 'info auxv' | grep AT_EXECFN | perl -pe "s/^.*\"(.*)\"\$/\$1/g") + echo dumping $corefile for $binary + gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" $binary $corefile +done +- name: Documentation + run: make -s docs diff --git a/.github/workflows/ci-macos.yml b/.github/workflows/ci-macos.yml new file mode 100644 index 00..2c20a5a279 --- /dev/null +++ b/.github/workflows/ci-
Re: Commitfest 2020-11 is closed
On Thu, Dec 3, 2020 at 12:02 PM Josef Šimánek wrote: > st 2. 12. 2020 v 23:36 odesílatel Andrew Dunstan napsal: > > On 12/2/20 5:13 PM, Thomas Munro wrote: > > > I'm experimenting with Github's built in CI. All other ideas welcome. > > > > I'd look very closely at gitlab. > > I was about to respond before with the same idea. Feel free to ping > regarding another CI. Also there is possibility to move to GitHub > Actions (free open source CI). I got some experience with that as > well. I spent today getting something working on Github just to try it out, and started a new thread[1] about that. I've not tried Gitlab and have no opinion about that; if someone has a working patch similar to that, I'd definitely be interested to take a look. I've looked at some others. For what cfbot is doing (namely: concentrating the work of hundreds into one "account" via hundreds of branches), the spin-up times and concurrency limits are a bit of a problem on many of them. FWIW I think they're all wonderful for offering this service to open source projects and I'm grateful that they do it! [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2By_SHVQcU3CPokmJxuHp1niebCjq4XzZizf8SR9ZdQRQ%40mail.gmail.com
Re: SELECT INTO deprecation
Stephen Frost schrieb am 02.12.2020 um 18:58: > We should either remove it, or remove the comments that it's deprecated, > not try to make it more deprecated or try to somehow increase the > recommendation to not use it. (I am writing from a "user only" perspective, not a developer) I don't see any warning about the syntax being "deprecated" in the current manual. There is only a note that says that CTAS is "recommended" instead of SELET INTO, but for me that's something entirely different than "deprecating" it. I personally have nothing against removing it, but I still see it used a lot in questions on various online forums, and I would think that a lot of people would be very unpleasantly surprised if a feature gets removed without any warning (the current "recommendation" does not constitute a deprecation or even removal warning for most people I guess) I would vote for a clear deprecation message as suggested by Peter, but I would add "and will be removed in a future version" to it. Not sure if maybe even back-patching that warning would make sense as well, so that also users of older versions get to see that warning. Then target 15 or 16 as the release for removal, but not 14 Thomas
Re: Github Actions (CI)
On Fri, Dec 4, 2020 at 2:55 AM Josef Šimánek wrote: > Any chance to also share links to failing/passing testing builds? https://github.com/macdice/postgres/runs/1490727809 https://github.com/macdice/postgres/runs/1490727838 However, looking these in a clean/cookieless browser, I see that it won't show you anything useful unless you're currently logged in to Github, so that's a major point against using it for cfbot (it's certainly not my intention to make cfbot only useful to people who have Github accounts).
Re: Index Skip Scan (new UniqueKeys)
On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas wrote: > On 01/12/2020 22:21, Dmitry Dolgov wrote: > >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > >> - I'm surprised you need a new index AM function (amskip) for this. Can't > >> you just restart the scan with index_rescan()? The btree AM can check if > >> the > >> new keys are on the same page, and optimize the rescan accordingly, like > >> amskip does. That would speed up e.g. nested loop scans too, where the keys > >> just happen to be clustered. > > > > An interesting point. At the moment I'm not sure whether it's possible > > to implement skipping via index_rescan or not, need to take a look. But > > checking if the new keys are on the same page would introduce some > > overhead I guess, wouldn't it be too invasive to add it into already > > existing btree AM? > > I think it'll be OK. But if it's not, you could add a hint argument to > index_rescan() to hint the index AM that the new key is known to be > greater than the previous key. FWIW here's what I wrote about that years ago[1]: > It works by adding a new index operation 'skip' which the executor > code can use during a scan to advance to the next value (for some > prefix of the index's columns). That may be a terrible idea and > totally unnecessary... but let me explain my > reasoning: > > 1. Perhaps some index implementations can do something better than a > search for the next key value from the root. Is it possible or > desirable to use the current position as a starting point for a btree > traversal? I don't know. > > 2. It seemed that I'd need to create a new search ScanKey to use the > 'rescan' interface for skipping to the next value, but I already had > an insertion ScanKey so I wanted a way to just reuse that. But maybe > there is some other way to reuse existing index interfaces, or maybe > there is an easy way to make a new search ScanKey from the existing >insertion ScanKey? [1] https://www.postgresql.org/message-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com
Re: Blocking I/O, async I/O and io_uring
On Tue, Dec 8, 2020 at 3:56 PM Craig Ringer wrote: > I thought I'd start the discussion on this and see where we can go with it. > What incremental steps can be done to move us toward parallelisable I/O > without having to redesign everything? > > I'm thinking that redo is probably a good first candidate. It doesn't depend > on the guts of the executor. It is much less sensitive to ordering between > operations in shmem and on disk since it runs in the startup process. And it > hurts REALLY BADLY from its single-threaded blocking approach to I/O - as > shown by an extension written by 2ndQuadrant that can double redo performance > by doing read-ahead on btree pages that will soon be needed. About the redo suggestion: https://commitfest.postgresql.org/31/2410/ does exactly that! It currently uses POSIX_FADV_WILLNEED because that's what PrefetchSharedBuffer() does, but when combined with a "real AIO" patch set (see earlier threads and conference talks on this by Andres) and a few small tweaks to control batching of I/O submissions, it does exactly what you're describing. I tried to keep the WAL prefetcher project entirely disentangled from the core AIO work, though, hence the "poor man's AIO" for now.
Re: Commitfest 2020-11 is closed
On Fri, Dec 4, 2020 at 1:29 AM Peter Eisentraut wrote: > On 2020-12-02 23:13, Thomas Munro wrote: > > I'm experimenting with Github's built in CI. All other ideas welcome. > > You can run Linux builds on AppVeyor, too. Yeah. This would be the easiest change to make, because I already have configuration files, cfbot already knows how to talk to Appveyor to collect results, and the build results are public URLs. So I'm leaning towards this option in the short term, so that cfbot keeps working after December 31. We can always review the options later. Appveyor's free-for-open-source plan has no cap on minutes, but limits concurrent jobs to 1. Gitlab's free-for-open-source plan is based on metered CI minutes, but cfbot is a bit too greedy for the advertised limits. An annual allotment of 50,000 minutes would run out some time in February assuming we rebase each of ~250 branches every few days. GitHub Actions has very generous resource limits, nice UX and API, etc etc, so it's tempting, but its build log URLs can only be accessed by people with accounts. That limits their usefulness when discussing test failures on our public mailing list. I thought about teaching cfbot to pull down the build logs and host them on its own web server, but that feels like going against the grain.
Re: Autovacuum worker doesn't immediately exit on postmaster death
On Fri, Dec 11, 2020 at 8:34 AM Robert Haas wrote: > On Thu, Oct 29, 2020 at 5:36 PM Alvaro Herrera > wrote: > > Maybe instead of thinking specifically in terms of vacuum, we could > > count buffer accesses (read from kernel) and check the latch once every > > 1000th such, or something like that. Then a very long query doesn't > > have to wait until it's run to completion. The cost is one integer > > addition per syscall, which should be bearable. > > Interesting idea. One related case is where everything is fine on the > server side but the client has disconnected and we don't notice that > the socket has changed state until something makes us try to send a > message to the client, which might be a really long time if the > server's doing like a lengthy computation before generating any rows. > It would be really nice if we could find a cheap way to check for both > postmaster death and client disconnect every now and then, like if a > single system call could somehow answer both questions. For the record, an alternative approach was proposed[1] that periodically checks for disconnected sockets using a timer, that will then cause the next CFI() to abort. Doing the check (a syscall) based on elapsed time rather than every nth CFI() or buffer access or whatever seems better in some ways, considering the difficulty of knowing what the frequency will be. One of the objections was that it added unacceptable setitimer() calls. We discussed an idea to solve that problem generally, and then later I prototyped that idea in another thread[2] about idle session timeouts (not sure about that yet, comments welcome). I've also wondered about checking postmaster_possibly_dead in CFI() on platforms where we have it (and working to increase that set of platforms), instead of just reacting to PM death when sleeping. But it seems like the real problem in this specific case is the use of pg_usleep() where WaitLatch() should be used, no? The recovery loop is at the opposite end of the spectrum: while vacuum doesn't check for postmaster death often enough, the recovery loop checks potentially hundreds of thousands or millions of times per seconds, which sucks on systems that don't have parent-death signals and slows down recovery quite measurably. In the course of the discussion about fixing that[3] we spotted other places that are using a pg_usleep() where they ought to be using WaitLatch() (which comes with exit-on-PM-death behaviour built-in). By the way, the patch in that thread does almost what Robert described, namely check for PM death every nth time (which in this case means every nth WAL record), except it's not in the main CFI(), it's in a special variant used just for recovery. [1] https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru [2] https://www.postgresql.org/message-id/flat/763a0689-f189-459e-946f-f0ec44589...@hotmail.com [3] https://www.postgresql.org/message-id/flat/CA+hUKGK1607VmtrDUHQXrsooU=ap4g4r2yaobywooa3m8xe...@mail.gmail.com
Re: Cache relation sizes?
Hi Andy, On Thu, Dec 17, 2020 at 7:29 PM Andy Fan wrote: > I spent one day studying the patch and I want to talk about one question for > now. > What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will > happen > if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)? The underlying theory of the patch is that after fsync() succeeds, the new size is permanent, but before that it could change at any time because of asynchronous activities in the kernel*. Therefore, if you want to evict the size from shared memory, you have to fsync() the file first. I used smgrimmedsync() to do that. *I suspect that it can really only shrink with network file systems such as NFS and GlusterFS, where the kernel has to take liberties to avoid network round trips for every file extension, but I don't know the details on all 9 of our supported OSes and standards aren't very helpful for understanding buffered I/O.
pg_preadv() and pg_pwritev()
Hello hackers, I want to be able to do synchronous vectored file I/O, so I made wrapper macros for preadv() and pwritev() with fallbacks for systems that don't have them. Following the precedent of the pg_pread() and pg_pwrite() macros, the "pg_" prefix reflects a subtle contract change: the fallback paths might have the side effect of changing the file position. They're non-standard system calls, but the BSDs and Linux have had them for a long time, and for other systems we can use POSIX readv()/writev() with an additional lseek(). The worst case is Windows (and maybe our favourite antique Unix build farm animal?) which has none of those things, so there is a further fallback to a loop. Windows does have ReadFileScatter() and WriteFileGather(), but those only work for overlapped (= asynchronous), unbuffered, page aligned access. They'll very likely be useful for native AIO+DIO support in the future, but don't fit the bill here. This is part of a project to consolidate and offload I/O (about which more soon), but seemed isolated enough to post separately and I guess it could be independently useful. From 4f4e26d0e2c023d62865365e3cb5ce1c89e63cb0 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 26 Nov 2020 15:48:31 +1300 Subject: [PATCH 07/11] Add pg_preadv() and pg_pwritev(). Provide synchronous scatter/gather I/O routines. These map to preadv(), pwritev() with various fallbacks for systems that don't have them. --- configure | 30 ++-- configure.ac | 9 +++-- src/include/pg_config.h.in | 15 ++ src/include/port.h | 37 ++ src/port/Makefile | 2 ++ src/port/pread.c | 41 -- src/port/pwrite.c | 41 -- src/tools/msvc/Solution.pm | 5 + 8 files changed, 146 insertions(+), 34 deletions(-) diff --git a/configure b/configure index cc2089e596..d25b8f2900 100755 --- a/configure +++ b/configure @@ -13199,7 +13199,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h fi -for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h +for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h do : as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh` ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default" @@ -15293,7 +15293,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l +for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pread preadv pstat pthread_is_threaded_np pwrite pwritev readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" @@ -15970,32 +15970,6 @@ esac fi -ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread" -if test "x$ac_cv_func_pread" = xyes; then : - $as_echo "#define HAVE_PREAD 1" >>confdefs.h - -else - case " $LIBOBJS " in - *" pread.$ac_objext "* ) ;; - *) LIBOBJS="$LIBOBJS pread.$ac_objext" - ;; -esac - -fi - -ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite" -if test "x$ac_cv_func_pwrite" = xyes; then : - $as_echo "#define HAVE_PWRITE 1" >>confdefs.h - -else - case " $LIBOBJS " in - *" pwrite.$ac_objext "* ) ;; - *) LIBOBJS="$LIBOBJS pwrite.$ac_objext" - ;; -esac - -fi - ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random" if test "x$ac_cv_func_random" = xyes; then : $as_echo "#define HAVE_RANDOM 1" >>confdefs.h diff --git a/configure.ac b/configure.ac index c819aeab26..9c21e46ec8 100644 ---
Re: pg_preadv() and pg_pwritev()
On Sun, Dec 20, 2020 at 12:34 PM Tom Lane wrote: > Thomas Munro writes: > > I want to be able to do synchronous vectored file I/O, so I made > > wrapper macros for preadv() and pwritev() with fallbacks for systems > > that don't have them. Following the precedent of the pg_pread() and > > pg_pwrite() macros, the "pg_" prefix reflects a subtle contract > > change: the fallback paths might have the side effect of changing the > > file position. > > In a quick look, seems OK with some nits: Thanks for looking! > 1. port.h cannot assume that has already been included; > nor do I want to fix that by including there. Do we > really need to define a fallback value of IOV_MAX? If so, > maybe the answer is to put the replacement struct iovec and > IOV_MAX in some new header. Ok, I moved all this stuff into port/pg_uio.h. > 2. I'm not really that happy about loading into > every compilation we do, which would be another reason for a > new specialized header that either includes or > provides fallback definitions. Ack. > 3. The patch as given won't prove anything except that the code > compiles. Is it worth fixing at least one code path to make > use of pg_preadv and pg_pwritev, so we can make sure this code > is tested before there's layers of other new code on top? OK, here's a patch to zero-fill fresh WAL segments with pwritev(). I'm drawing a blank on trivial candidate uses for preadv(), without infrastructure from later patches. From d7178f296642e177bc57fabe93abec395cbaac5f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 26 Nov 2020 15:48:31 +1300 Subject: [PATCH v2 1/2] Add pg_preadv() and pg_pwritev(). Provide synchronous scatter/gather I/O routines. These map to preadv(), pwritev() with various fallbacks for systems that don't have them. Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com --- configure | 30 ++ configure.ac | 9 +-- src/include/pg_config.h.in | 15 +++ src/include/port.h | 2 ++ src/include/port/pg_uio.h | 51 ++ src/port/Makefile | 2 ++ src/port/pread.c | 43 ++-- src/port/pwrite.c | 43 ++-- src/tools/msvc/Solution.pm | 5 9 files changed, 166 insertions(+), 34 deletions(-) create mode 100644 src/include/port/pg_uio.h diff --git a/configure b/configure index 11a4284e5b..5887c712cc 100755 --- a/configure +++ b/configure @@ -13061,7 +13061,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h fi -for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h +for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h do : as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh` ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default" @@ -15155,7 +15155,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l +for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pread preadv pstat pthread_is_threaded_np pwrite pwritev readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" @@ -15832,32 +15832,6 @@ esac fi -ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread" -if test "x$ac_cv_func_pread" = xyes; then : - $as_echo "#define HAVE_PREAD 1" >>confdefs.h - -else - case " $LIBOBJS " in - *" pread.$ac_objext "* ) ;; - *) LIBOBJS="$LIBOBJS pread.$ac_objext" - ;; -esac - -fi - -ac_fn_c_check_func &
Re: pg_preadv() and pg_pwritev()
On Sun, Dec 20, 2020 at 8:07 PM Tom Lane wrote: > One minor thought is that in > > + struct iovec iov[Min(IOV_MAX, 1024)]; /* cap stack space */ > > it seems like pretty much every use of IOV_MAX would want some > similar cap. Should we centralize that idea with, say, > > #define PG_IOV_MAX Min(IOV_MAX, 1024) > > ? Or will the plausible cap vary across uses? Hmm. For the real intended user of this, namely worker processes that simulate AIO when native AIO isn't available, higher level code will limit the iov count to much smaller numbers anyway. It wants to try to stay under typical device limits for vectored I/O, because split requests would confound attempts to model and limit queue depth and control latency. In Andres's AIO prototype he currently has a macro PGAIO_MAX_COMBINE set to 16 (meaning approximately 16 data block or wal reads/writes = 128KB worth of scatter/gather per I/O request); I guess it should really be Min(IOV_MAX, ), but I don't currently have an opinion on the , except that it should surely be closer to 16 than 1024 (for example /sys/block/nvme0n1/queue/max_segments is 33 here). I mention all this to explain that I don't think the code in patch 0002 is going to turn out to be very typical: it's trying to minimise system calls by staying under an API limit (though I cap it for allocation sanity), whereas more typical code probably wants to stay under a device limit, so I don't immediately have another use for eg PG_IOV_MAX.
Re: pg_preadv() and pg_pwritev()
On Mon, Dec 21, 2020 at 8:25 PM Jakub Wartak wrote: > > > I'm drawing a blank on trivial candidate uses for preadv(), without > > > infrastructure from later patches. > > > > Can't immediately think of something either. > > This might be not that trivial , but maybe acquire_sample_rows() from > analyze.c ? > > Please note however there's patch > https://www.postgresql.org/message-id/20201109180644.GJ16415%40tamriel.snowman.net > ( https://commitfest.postgresql.org/30/2799/ ) for adding fadvise, but maybe > those two could be even combined so you would be doing e.g. 16x fadvise() and > then grab 8 pages in one preadv() call ? I'm find unlikely however that > preadv give any additional performance benefit there after having fadvise, > but clearly a potential place to test. Oh, interesting, that looks like another test case to study with the AIO patch set, but I don't think it's worth trying to do a simpler/half-baked version in the meantime. (Since that ANALYZE patch uses PrefetchBuffer() it should automatically benefit: the posix_fadvise() calls will be replaced with consolidated preadv() calls in a worker process or native AIO equivalent so that system calls are mostly gone from the initiating process, and by the time you try to access the buffer it'll hopefully see that it's finished without any further system calls. Refinements are possible though, like making use of recent_buffer to avoid double-lookup, and tuning/optimisation for how often IOs should be consolidated and submitted.)
Re: pg_preadv() and pg_pwritev()
On Mon, Dec 21, 2020 at 11:40 AM Andres Freund wrote: > On 2020-12-20 16:26:42 +1300, Thomas Munro wrote: > > > 1. port.h cannot assume that has already been included; > > > nor do I want to fix that by including there. Do we > > > really need to define a fallback value of IOV_MAX? If so, > > > maybe the answer is to put the replacement struct iovec and > > > IOV_MAX in some new header. > > > > Ok, I moved all this stuff into port/pg_uio.h. > > Can we come up with a better name than 'uio'? I find that a not exactly > meaningful name. Ok, let's try port/pg_iovec.h. > Or perhaps we could just leave the functions in port/port.h, but extract > the value of the define at configure time? That way only pread.c / > pwrite.c would need it. That won't work for the struct definition, so client code would need to remember to do: #ifdef HAVE_SYS_UIO_H #include #endif ... which is a little tedious, or port.h would need to do that and incur an overhead in every translation unit, which Tom objected to. That's why I liked the separate header idea. > > > 3. The patch as given won't prove anything except that the code > > > compiles. Is it worth fixing at least one code path to make > > > use of pg_preadv and pg_pwritev, so we can make sure this code > > > is tested before there's layers of other new code on top? > > > > OK, here's a patch to zero-fill fresh WAL segments with pwritev(). > > I think that's a good idea. However, I see two small issues: 1) If we > write larger amounts at once, we need to handle partial writes. That's a > large enough amount of IO that it's much more likely to hit a memory > shortage or such in the kernel - we had to do that a while a go for WAL > writes (which can also be larger), if memory serves. > > Perhaps we should have pg_pwritev/readv unconditionally go through > pwrite.c/pread.c and add support for "continuing" partial writes/reads > in one central place? Ok, here's a new version with the following changes: 1. Define PG_IOV_MAX, a reasonable size for local variables, not larger than IOV_MAX. 2 Use 32 rather than 1024, based on off-list complaint about 1024 potentially swamping the IO system unfairly. 3. Add a wrapper pg_pwritev_retry() to retry automatically on short writes. (I didn't write pg_preadv_retry(), because I don't currently need it for anything so I didn't want to have to think about EOF vs short-reads-for-implementation-reasons.) 4. I considered whether pg_pwrite() should have built-in retry instead of a separate wrapper, but I thought of an argument against hiding the "raw" version: the AIO patch set already understands short reads/writes and knows how to retry at a higher level, as that's needed for native AIO too, so I think it makes sense to be able to keep things the same and not solve the same problem twice. A counter argument would be that you could get the retry underway sooner with a tight loop, but I'm not expecting this to be common. > > I'm drawing a blank on trivial candidate uses for preadv(), without > > infrastructure from later patches. > > Can't immediately think of something either. One idea I had for the future is for xlogreader.c to read the WAL into a large multi-page circular buffer, which could wrap around using a pair of iovecs, but that requires lots more work. From 3da0b25b4ce0804355f912bd026ef9c9eee146f3 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 26 Nov 2020 15:48:31 +1300 Subject: [PATCH v3 1/2] Add pg_preadv() and pg_pwritev(). Provide synchronous scatter/gather I/O routines. These map to preadv(), pwritev() with various fallbacks for systems that don't have them. Also provide a wrapper pg_pwritev_retry() which automatically retries on short write. Reviewed-by: Tom Lane Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com --- configure | 30 +- configure.ac| 9 ++- src/include/pg_config.h.in | 15 + src/include/port.h | 2 + src/include/port/pg_iovec.h | 57 +++ src/port/Makefile | 2 + src/port/pread.c| 43 +- src/port/pwrite.c | 109 +++- src/tools/msvc/Solution.pm | 5 ++ 9 files changed, 238 insertions(+), 34 deletions(-) create mode 100644 src/include/port/pg_iovec.h diff --git a/configure b/configure index 11a4284e5b..5887c712cc 100755 --- a/configure +++ b/configure @@ -13061,7 +13061,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h fi -for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys
Re: Cache relation sizes?
On Thu, Dec 17, 2020 at 10:22 PM Andy Fan wrote: > Let me try to understand your point. Suppose process 1 extends a file to > 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still > return 1 based on the comments in the ReadBuffer_common ("because > of buggy Linux kernels that sometimes return an seek SEEK_END result > that doesn't account for a recent write"). b). When this patch is introduced, > we would get 2 blocks for sure if we can get the size from cache. c). After > user reads the 2 blocks from cache and then the cache is evicted, and user > reads the blocks again, it is possible to get 1 again. > > Do we need to consider case a) in this patch? and Is case c). the situation > you > want to avoid and do we have to introduce fsync to archive this? Basically I > think case a) should not happen by practice so we don't need to handle case > c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag. > I'm not opposed to adding them, I am just interested in the background in case > you are also willing to discuss it. Sorry for the lack of background -- there was some discussion on the thread "Optimize dropping of relation buffers using dlist", but it's long and mixed up with other topics. I'll try to summarise here. It's easy to reproduce files shrinking asynchronously on network filesystems that reserve space lazily[1], and we know that there have historically been problems even with local filesystems[2], leading to that comment about buggy kernels. I am only interested in the behaviour I can demonstrate, because I believe Linux is working as intended when it does that (hypothetical/unknown bugs would probably be helped by this approach too, but I'm not interested in speculating about that beyond these parentheses). So, why should we consider this? It's true that commit ffae5cc5a60's change to ReadBuffer_common() already prevents us from eating data by overwriting an existing buffer due to this phenomenon, but there are other problems: 1. A system that in this state can give an incorrect answer to a query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of a scan, so it might not see recently added blocks containing committed data. Hopefully we'll panic when we try to create a checkpoint and see errors, but we could be giving bad results for a while. 2. Commitfest entry #2319 needs an accurate size, or it might leave stray blocks in the buffer pool that will cause failures or corruption later. It's true that we could decide not to worry about this, and perhaps even acknowledge in some official way that PostgreSQL doesn't work reliably on some kinds of filesystems. But in this prototype I thought it was fun to try to fix our very high rate of lseek() syscalls and this reliability problem, at the same time. I also speculate that we'll eventually have other reasons to want a demand-paged per-relation shared memory object. > I would suggest the below change for smgrnblock_fast, FYI > > @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum) > /* > * The generation doesn't match, the shared relation must > have been > * evicted since we got a pointer to it. We'll need to do > more work. > +* and invalid the fast path for next time. > */ > + reln->smgr_shared = NULL; > } Thanks, makes sense -- I'll add this to the next version. [1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com [2] https://www.postgresql.org/message-id/BAY116-F146A38326C5630EA33B36BD12B0%40phx.gbl
Re: Re: Cache relation sizes?
On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) wrote: > I studied your patch these days and found there might be a problem. > When execute 'drop database', the smgr shared pool will not be removed > because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove > the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't > call smgrdounlinkall, so the smgr shared cache will not be dropped although > the table has been removed. This will cause some errors when smgr_alloc_str > -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and > smgrimmedsync will get a unexpected result. Hi Buzhen, Thanks, you're right -- it needs to scan the pool of SRs and forget everything from the database you're dropping.
Re: doc review for v14
On Thu, Dec 24, 2020 at 9:12 PM Michael Paquier wrote: > On Mon, Dec 21, 2020 at 10:11:53PM -0600, Justin Pryzby wrote: > > Specifies the amount of memory that should be allocated at server > > -startup time for use by parallel queries. When this memory region > > is > > +startup for use by parallel queries. When this memory region is > > insufficient or exhausted by concurrent queries, new parallel > > queries > > try to allocate extra shared memory temporarily from the operating > > system using the method configured with > > dynamic_shared_memory_type, which may be slower > > due > > to memory management overheads. Memory that is allocated at > > startup > > -time with min_dynamic_shared_memory is affected > > by > > +with min_dynamic_shared_memory is affected by > > the huge_pages setting on operating systems > > where > > that is supported, and may be more likely to benefit from larger > > pages > > on operating systems where that is managed automatically. > > The current formulation is not that confusing, but I agree that this > is an improvement. Thomas, you are behind this one. What do you > think? LGTM.
Re: Parallel Full Hash Join
On Mon, Dec 28, 2020 at 9:49 PM Masahiko Sawada wrote: > On Thu, Nov 5, 2020 at 7:34 AM Melanie Plageman > wrote: > > I've attached a patch with the corrections I mentioned upthread. > > I've gone ahead and run pgindent, though, I can't say that I'm very > > happy with the result. > > > > I'm still not quite happy with the name > > BarrierArriveAndDetachExceptLast(). It's so literal. As you said, there > > probably isn't a nice name for this concept, since it is a function with > > the purpose of terminating parallelism. > > You sent in your patch, v3-0001-Support-Parallel-FOJ-and-ROJ.patch to > pgsql-hackers on Nov 5, but you did not post it to the next > CommitFest[1]. If this was intentional, then you need to take no > action. However, if you want your patch to be reviewed as part of the > upcoming CommitFest, then you need to add it yourself before > 2021-01-01 AOE[2]. Also, rebasing to the current HEAD may be required > as almost two months passed since when this patch is submitted. Thanks > for your contributions. Thanks for this reminder Sawada-san. I had some feedback I meant to post in November but didn't get around to: +bool +BarrierArriveAndDetachExceptLast(Barrier *barrier) I committed this part (7888b099). I've attached a rebase of the rest of Melanie's v3 patch. +WAIT_EVENT_HASH_BATCH_PROBE, That new wait event isn't needed (we can't and don't wait). * PHJ_BATCH_PROBING-- all probe - * PHJ_BATCH_DONE -- end + + * PHJ_BATCH_DONE -- queries not requiring inner fill done + * PHJ_BATCH_FILL_INNER_DONE -- inner fill completed, all queries done Would it be better/tidier to keep _DONE as the final phase? That is, to switch around these two final phases. Or does that make it too hard to coordinate the detach-and-cleanup logic? +/* + * ExecPrepHashTableForUnmatched + * set up for a series of ExecScanHashTableForUnmatched calls + * return true if this worker is elected to do the unmatched inner scan + */ +bool +ExecParallelPrepHashTableForUnmatched(HashJoinState *hjstate) Comment name doesn't match function name. From 9199bfcfa84acbcfeb9a8d3c21962096c7ff645c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 5 Nov 2020 16:20:24 +1300 Subject: [PATCH v4 1/2] Parallel Hash Full/Right Join. Previously we allowed PHJ only for inner and left outer joins. Share the hash table match bits between backends, so we can also also handle full and right joins. In order to do that without introducing any deadlock risks, for now we drop down to a single process for the unmatched scan at the end of each batch. Other processes detach and look for another batch to help with. If there aren't any more batches, they'll finish the hash join early, making work distribution suboptimal. Improving that might require bigger executor changes. Also switch the unmatched tuple scan to work in memory-chunk order, rather than bucket order. This prepares for potential later improvements that would use chunks as parallel grain, and seems to be a little cache-friendlier than the bucket-scan scheme scheme in the meantime. Author: Melanie Plageman Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com --- src/backend/executor/nodeHash.c | 190 ++-- src/backend/executor/nodeHashjoin.c | 61 src/backend/optimizer/path/joinpath.c | 14 +- src/include/executor/hashjoin.h | 15 +- src/include/executor/nodeHash.h | 3 + src/test/regress/expected/join_hash.out | 56 ++- src/test/regress/sql/join_hash.sql | 23 ++- 7 files changed, 274 insertions(+), 88 deletions(-) diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index ea69eeb2a1..36cc752163 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -510,6 +510,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, hashtable->spaceAllowed * SKEW_HASH_MEM_PERCENT / 100; hashtable->chunks = NULL; hashtable->current_chunk = NULL; + hashtable->current_chunk_idx = 0; hashtable->parallel_state = state->parallel_state; hashtable->area = state->ps.state->es_query_dsa; hashtable->batches = NULL; @@ -2053,9 +2054,58 @@ ExecPrepHashTableForUnmatched(HashJoinState *hjstate) * hj_CurTuple: last tuple returned, or NULL to start next bucket *-- */ + HashJoinTable hashtable = hjstate->hj_HashTable; + hjstate->hj_CurBucketNo = 0; hjstate->hj_CurSkewBucketNo = 0; hjstate->hj_CurTuple = NULL; + hashtable->current_chunk = hashtable->chunks; + hashtable->current_chunk_idx = 0; +} + +/* + * ExecPrepHashTableForUnmatched + * set up for a
Re: Let's start using setenv()
On Tue, Dec 29, 2020 at 4:21 PM Tom Lane wrote: > Back in 2001 we decided to prefer putenv() over setenv() because the > latter wasn't in POSIX (SUSv2) at the time. That decision has been > overtaken by events: more recent editions of POSIX not only provide > setenv() but recommend it over putenv(). So I think it's time to > change our policy and prefer setenv(). We've had previous discussions > about that but nobody did the legwork yet. > > The attached patch provides the infrastructure to allow using setenv(). > I added a src/port/ implementation of setenv() for any machines that > haven't caught up with POSIX lately. (I've tested that code on my old > dinosaur gaur, and I would not be surprised if that is the only machine > anywhere that ever runs it. But I could be wrong.) I also extended > win32env.c to support setenv(). +1, nice modernisation. > I also changed our unsetenv() emulations to make them return an int > error indicator, as per POSIX. I have no desire to change the call > sites to check for errors, but it seemed like our compatibility stubs > should be compatible with the spec. (Note: I think that unsetenv() > did return void in old BSD versions, before POSIX standardized it. > So that might be a real reason not to change the callers.) +1 for conformance. I suppose it's out of spec that unsetenv() can return -1 with errno = ENOMEM (from malloc()), but that hardly matters. Hmm... could a static buffer be used for that clobbering trick? For the note in parens: it looks like the 3 main BSDs all switched from the historical void function to the POSIX one 12-18 years ago[1][2][3], so I wouldn't worry about that. Glibc made a similar change 19 years ago. [1] https://github.com/NetBSD/src/commit/13dee93fb3a93189a74a3705a5f7cd1c6b290125 [2] https://github.com/openbsd/src/commit/471b62eeaa7f3a18c0aa98b5d605e5cec1625b62 [3] https://github.com/freebsd/freebsd/commit/196b6346ba4e13a3f7679e2de3317b6aa65983df
Re: Let's start using setenv()
On Wed, Dec 30, 2020 at 5:45 AM Tom Lane wrote: > Since the cfbot seems happy with v1, here's a v2 that runs around > and converts all putenv() uses to setenv(). In some places there's > no notational savings, but it seems to me that standardizing > on just one way to do it is beneficial. +if (setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0) != 0) { -size_tkt_len = strlen(pg_krb_server_keyfile) + 14; -char *kt_path = malloc(kt_len); - -if (!kt_path || -snprintf(kt_path, kt_len, "KRB5_KTNAME=%s", - pg_krb_server_keyfile) != kt_len - 2 || -putenv(kt_path) != 0) -{ -ereport(LOG, -(errcode(ERRCODE_OUT_OF_MEMORY), - errmsg("out of memory"))); -return STATUS_ERROR; -} +/* We assume the error must be "out of memory" */ +ereport(LOG, +(errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); +return STATUS_ERROR; } Wouldn't it be better to do "cannot set environment: %m" or similar, and let ENOMEM do its thing if that be the cause? It's true that POSIX only allows EINVAL or ENOMEM and we can see no reason for EINVAL here, but still it seems unnecessary to assume. -if (getenv("PGLOCALEDIR") == NULL) -{ -/* set for libpq to use */ -snprintf(env_path, sizeof(env_path), "PGLOCALEDIR=%s", path); -canonicalize_path(env_path + 12); -dup_path = strdup(env_path); -if (dup_path) -putenv(dup_path); -} +/* set for libpq to use, but don't override existing setting */ +setenv("PGLOCALEDIR", path, 0); Did you want to drop the canonicalize_path() logic here and nearby?
Re: Cache relation sizes?
On Mon, Dec 28, 2020 at 5:24 PM Andy Fan wrote: > lseek(..., SEEK_END) = 9670656 > write(...) = 8192 > lseek(..., SEEK_END) = 9678848 > fsync(...) = -1 > lseek(..., SEEK_END) = 9670656 > > I got 2 information from above. a) before the fsync, the lseek(fd, 0, > SEEK_END) > returns a correct value, however after the fsync, it returns a wrong value. > b). > the fsync failed(return values is -1) in the above case. I'm more confused > because of point a). Since the fsync can't correct some wrong value, what > is the point to call fsync in this patch (I agree that it is good to fix this > reliability problem within this patch, but I'm doubtful if fsync can help in > this case). Am I missing something obviously? The point of the fsync() call isn't to correct the value, it's to find out if it is safe to drop the SR object from shared memory because the operating system has promised that it won't forget about the new size. If that fails, we'll report an ERROR or PANIC (depending on data_sync_retry), but not drop the SR. By the way, you don't need fsync(fd) for the size to change, as shown by the other experiment in that other thread that just did sleep(60). It might also happen asynchronously. fsync(fd) forces it, but also tells you about errors. > I read your patch with details some days ago and feel it is in pretty good > shape. > and I think you are clear about the existing issues very well. like a). > smgr_alloc_sr use a > FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really. c). > and Yeah, it probably should have its own bank of locks (I wish they were easier to add, via lwlocknames.txt or similar). > looks like you have some concern about the concurrency issue, which I didn't > find > anything wrong. d). how to handle the DIRTY sr during evict. I planned to > recheck So yeah, mostly this discussion has been about not trusting lseek() and using our own tracking of relation size instead. But there is a separate question about how "fresh" the value needs to be. The question is: under what circumstances could the unlocked read of nblocks from shared memory give you a value that isn't fresh enough for your snapshot/scan? This involves thinking about implicit memory barriers. The idea of RECENTLY_DIRTIED is similar to what we do for buffers: while you're trying to "clean" it (in this case by calling fsync()) someone else can extend the relation again, and in that case we don't know whether the new extension was included in the fsync() or not, so we can't clear the DIRTY flag when it completes. > item c) before this reply, but looks like the time is not permitted. May I > know what > Is your plan about this patch? Aside from details, bugs etc discussed in this thread, one major piece remains incomplete: something in the background to "clean" SRs so that foreground processes rarely have to block.
Re: Re: Re: Cache relation sizes?
On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) wrote: > I found some other problems which I want to share my change with you to make > you confirm. > <1> I changed the code in smgr_alloc_sr to avoid dead lock. > > LWLockAcquire(mapping_lock, LW_EXCLUSIVE); > flags = smgr_lock_sr(sr); > Assert(flags & SR_SYNCING(forknum)); > + flags &= ~SR_SYNCING(forknum); > if (flags & SR_JUST_DIRTIED(forknum)) > { >/* > * Someone else dirtied it while we were syncing, so we can't mark > * it clean. Let's give up on this SR and go around again. > */ >smgr_unlock_sr(sr, flags); >LWLockRelease(mapping_lock); >goto retry; > } > /* This fork is clean! */ > - flags &= ~SR_SYNCING(forknum); > flags &= ~SR_DIRTY(forknum); > } > > In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not > SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will > retry to get another sr, although it has been synced by smgrimmedsync, the > flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed > your patch as above. Right. Thanks! I also added smgrdropdb() to handle DROP DATABASE (the problem you reported in your previous email). While tweaking that code, I fixed it so that it uses a condition variable to wait (instead of the silly sleep loop) when it needs to drop an SR that is being sync'd. Also, it now releases the mapping lock while it's doing that, and requires it on retry. > <2> I changed the code in smgr_drop_sr to avoid some corner cases > /* Mark it invalid and drop the mapping. */ > smgr_unlock_sr(sr, ~SR_VALID); > + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) > + sr->nblocks[forknum] = InvalidBlockNumber; > hash_search_with_hash_value(sr_mapping_table, >&reln->smgr_rnode, >hash, >HASH_REMOVE, >NULL); > > smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't > remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so > I add some codes as above to avoid some corner cases get an unexpected result > from smgrnblocks_fast. Is it necessary, I also want some advice from you. Hmm. I think it might be better to increment sr->generation. That was already done in the "eviction" code path, where other processes might still have references to the SR object, and I don't think it's possible for anyone to access a dropped relation, but I suppose it's more consistent to do that anyway. Fixed. Thanks for the review! From 1d49405b41b3cdd6baf6f29219f18c6adefb5547 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 13 Nov 2020 14:38:41 +1300 Subject: [PATCH v3 1/2] WIP: Track relation sizes in shared memory. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a fixed size pool of SMgrSharedRelation objects. A new GUC smgr_shared_relation controls how many can exist at once, and they are evicted as required. "Dirty" SMgrSharedRelations can only be evicted after being synced to disk. Goals: 1. Provide faster lookups of relation sizes, cutting down on lseek() calls. This supercedes the recovery-only caching added recently, and replaces preexisting FSM and VM caching schemes. 2. Stop trusting the operating system to keep track of the size of files that we have recently extended, until fsync() has been called. XXX smgrimmedsync() is maybe too blunt an instrument? XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via some new interface, but it doesn't actually know if it's cleaning a fork that extended a relation... XXX perhaps bgwriter should try to clean them? XXX currently reusing the array of locks also used for buffer mapping, need to define some more in lwlocks.c... Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com Reviewed-by: 陈佳昕(步真) Reviewed-by: Andy Fan --- contrib/pg_visibility/pg_visibility.c | 1 - src/backend/access/heap/visibilitymap.c | 28 +- src/backend/catalog/storage.c | 2 - src/backend/commands/dbcommands.c | 3 + src/backend/postmaster/pgstat.c | 3 + src/backend/storage/freespace/freespace.c | 35 +- src/backend/storage/ipc/ipci.c| 3 + src/backend/storage/smgr/md.c | 10 + src/backend/storage/smgr/smgr.c | 540 -- src/backend/utils/misc/guc.c | 11 + src/include/pgstat.h | 1 + src/include/storage/smgr.h| 18 +- 12 files changed, 566 insertions(+), 89 deletions(-) diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 54e47b810f..702951e487 1
Re: Re: Re: Cache relation sizes?
On Wed, Dec 30, 2020 at 5:52 PM Thomas Munro wrote: > and requires it on retry s/requires/reacquires/
Re: pgbench: option delaying queries till connections establishment?
On Sun, Nov 15, 2020 at 4:53 AM Fabien COELHO wrote: > In the attached version, I just comment out the call and add an > explanation about why it is commented out. If pg overall version > requirements are changed on windows, then it could be reinstated. It looks like macOS doesn't have pthread barriers (via cfbot 2021, now with more operating systems): pgbench.c:326:8: error: unknown type name 'pthread_barrier_t' static pthread_barrier_t barrier; ^ pgbench.c:6128:2: error: implicit declaration of function 'pthread_barrier_init' is invalid in C99 [-Werror,-Wimplicit-function-declaration] pthread_barrier_init(&barrier, NULL, nthreads); ^
Re: new heapcheck contrib module
On Tue, Oct 27, 2020 at 5:12 AM Mark Dilger wrote: > The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a > rebase. (0001 was already committed last week.) > > Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with > the previous naming. There are no substantial changes. Hi Mark, The command line stuff fails to build on Windows[1]. I think it's just missing #include "getopt_long.h" (see contrib/vacuumlo/vacuumlo.c). [1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.123328
Re: WIP: BRIN multi-range indexes
On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra wrote: > Attached is an updated version of the patch series, rebased on current > master, and results for benchmark comparing the various bloom variants. Perhaps src/include/utils/inet.h needs to include , because FreeBSD says: brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET' if (ip_family(ipa) == PGSQL_AF_INET) ^ ../../../../src/include/utils/inet.h:39:24: note: expanded from macro 'PGSQL_AF_INET' #define PGSQL_AF_INET (AF_INET + 0) ^
Re: psql \df choose functions by their arguments
On Thu, Dec 31, 2020 at 7:01 AM Greg Sabino Mullane wrote: > Attached is the latest patch against HEAD - basically fixes a few typos. Hi Greg, It looks like there is a collation dependency here that causes the test to fail on some systems: === ./src/test/regress/regression.diffs === diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/psql.out /tmp/cirrus-ci-build/src/test/regress/results/psql.out --- /tmp/cirrus-ci-build/src/test/regress/expected/psql.out 2021-01-01 16:05:25.749692000 + +++ /tmp/cirrus-ci-build/src/test/regress/results/psql.out 2021-01-01 16:11:28.525632000 + @@ -5094,8 +5094,8 @@ public | mtest | integer | double precision, double precision, integer | func public | mtest | integer | integer | func public | mtest | integer | integer, text | func - public | mtest | integer | timestamp without time zone, timestamp with time zone | func public | mtest | integer | time without time zone, time with time zone | func + public | mtest | integer | timestamp without time zone, timestamp with time zone | func
Re: Context diffs
On Tue, Jan 5, 2021 at 8:07 AM Bruce Momjian wrote: > * "git apply" and "git am" can't process context diffs (they throw an >error once a context-like section of the diff is hit; simple >adding/removing lines in a block works) > > * the commit-fest doesn't recognized context diff attachments as > patches: > > https://commitfest.postgresql.org/31/2912/ > > * cfbot can't process file renames/add/delete from context diffs For the record, cfbot just uses plain old GNU patch, because that seems to accept nearly everything that anyone posts here (after a step that tries to unpack tarballs etc). Several people have suggested I change it to use git apply instead (IIRC it works better for patches containing binary files such as cryptographic keys?), but then it wouldn't accept ye olde context diffs.
Re: Handing off SLRU fsyncs to the checkpointer
On Mon, Jan 4, 2021 at 3:35 AM Tomas Vondra wrote: > Seems this commit left behind a couple unnecessary prototypes in a bunch > of header files. In particular, it removed these functions > > - ShutdownCLOG(); > - ShutdownCommitTs(); > - ShutdownSUBTRANS(); > - ShutdownMultiXact(); Thanks. Fixed.