Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-13 Thread Thomas Munro
On Sat, May 14, 2022 at 3:33 AM Robert Haas wrote: > This seems fine, but I think you should add a non-trivial comment about it. Thanks for looking. Done, and pushed. Let's see if 180s per query is enough...

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-13 Thread Robert Haas
On Thu, May 12, 2022 at 10:20 PM Thomas Munro wrote: > As for skink failing, the timeout was hard coded 300s for the whole > test, but apparently that wasn't enough under valgrind. Let's use the > standard PostgreSQL::Test::Utils::timeout_default (180s usually), but > reset it for each query we s

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-12 Thread Thomas Munro
On Thu, May 12, 2022 at 4:57 PM Thomas Munro wrote: > On Thu, May 12, 2022 at 3:13 PM Thomas Munro wrote: > > error running SQL: 'psql::1: ERROR: source database > > "conflict_db_template" is being accessed by other users > > DETAIL: There is 1 other session using the database.' > > Oh, for thi

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-11 Thread Thomas Munro
On Thu, May 12, 2022 at 3:13 PM Thomas Munro wrote: > Chipmunk, another little early model Raspberry Pi: > > error running SQL: 'psql::1: ERROR: source database > "conflict_db_template" is being accessed by other users > DETAIL: There is 1 other session using the database.' Oh, for this one I t

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-11 Thread Thomas Munro
On Sat, May 7, 2022 at 9:37 PM Thomas Munro wrote: > So far "grison" failed. I think it's probably just that the test > forgot to wait for replay of CREATE EXTENSION before using pg_prewarm > on the standby, hence "ERROR: function pg_prewarm(oid) does not exist > at character 12". I'll wait for

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-10 Thread Thomas Munro
On Tue, May 10, 2022 at 1:07 AM Robert Haas wrote: > On Sun, May 8, 2022 at 7:30 PM Thomas Munro wrote: > > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > > STATEMENT: alter database mydb set tablespace ts1; > This is a very good idea. OK, I pushed this, after making the ere

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-09 Thread Robert Haas
On Sun, May 8, 2022 at 7:30 PM Thomas Munro wrote: > Simple idea: how about logging the PID of processes that block > progress for too long? In the attached, I arbitrarily picked 5 > seconds as the wait time between LOG messages. Also, DEBUG1 messages > let you see the processing speed on eg bui

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-08 Thread Thomas Munro
On Sat, May 7, 2022 at 4:52 PM Thomas Munro wrote: > I think we'll probably also want to invent a way > to report which backend is holding up progress, since without that > it's practically impossible for an end user to understand why their > command is hanging. Simple idea: how about logging the

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-07 Thread Thomas Munro
On Sat, May 7, 2022 at 4:52 PM Thomas Munro wrote: > Done. Time to watch the build farm. So far "grison" failed. I think it's probably just that the test forgot to wait for replay of CREATE EXTENSION before using pg_prewarm on the standby, hence "ERROR: function pg_prewarm(oid) does not exist

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-06 Thread Thomas Munro
On Wed, May 4, 2022 at 2:23 PM Thomas Munro wrote: > Assuming no > objections or CI failures show up, I'll consider pushing the first two > patches tomorrow. Done. Time to watch the build farm. It's possible that these changes will produce some blowback, now that we're using PROCSIGNAL_BARRIER_

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Thomas Munro
On Wed, May 4, 2022 at 8:53 AM Thomas Munro wrote: > Got some off-list clues: that's just distracting Perl cleanup noise > after something else went wrong (thanks Robert), and now I'm testing a > theory from Andres that we're missing a barrier on the redo side when > replaying XLOG_DBASE_CREATE_FI

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Thomas Munro
On Wed, May 4, 2022 at 7:44 AM Thomas Munro wrote: > It passes sometimes and fails sometimes. Here's the weird failure I > need to debug: > > https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse > > Right at the end, it

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Thomas Munro
On Wed, May 4, 2022 at 6:36 AM Robert Haas wrote: > On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro wrote: > > So, to summarise the new patch that I'm attaching to this email as 0001: > > This all makes sense to me, and I didn't see anything obviously wrong > looking through the patch, either. Than

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Robert Haas
On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro wrote: > So, to summarise the new patch that I'm attaching to this email as 0001: This all makes sense to me, and I didn't see anything obviously wrong looking through the patch, either. > However it seems that I have something wrong, because CI is fa

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-22 Thread Thomas Munro
On Wed, Apr 6, 2022 at 5:07 AM Robert Haas wrote: > On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro wrote: > > > The checkpointer never takes heavyweight locks, so the opportunity > > > you're describing can't arise. > > > > Hmm, oh, you probably meant the buffer interlocking > > in SyncOneBuffer(

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-05 Thread Robert Haas
On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro wrote: > > The checkpointer never takes heavyweight locks, so the opportunity > > you're describing can't arise. > > Hmm, oh, you probably meant the buffer interlocking > in SyncOneBuffer(). It's true that my most recent patch throws away > more requ

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Thomas Munro
On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro wrote: > On Tue, Apr 5, 2022 at 2:18 AM Robert Haas wrote: > > I'm not sure that it really matters, but with the idea that I proposed > > it's possible to "save" a pending writeback, if we notice that we're > > accessing the relation with a proper lock

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Thomas Munro
On Tue, Apr 5, 2022 at 2:18 AM Robert Haas wrote: > On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro wrote: > > Another idea would be to call a new function DropPendingWritebacks(), > > and also tell all the SMgrRelation objects to close all their internal > > state (ie the fds + per-segment objects)

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Robert Haas
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for go

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro
On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for g

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro
On Sat, Apr 2, 2022 at 2:52 AM Robert Haas wrote: > On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro wrote: > > The v1-0003 patch introduced smgropen_cond() to avoid the problem of > > IssuePendingWritebacks(), which does desynchronised smgropen() calls > > and could open files after the barrier but j

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Robert Haas
On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro wrote: > The v1-0003 patch introduced smgropen_cond() to avoid the problem of > IssuePendingWritebacks(), which does desynchronised smgropen() calls > and could open files after the barrier but just before they are > unlinked. Makes sense, but... > > 1.

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-31 Thread Thomas Munro
Some thoughts: The v1-0003 patch introduced smgropen_cond() to avoid the problem of IssuePendingWritebacks(), which does desynchronised smgropen() calls and could open files after the barrier but just before they are unlinked. Makes sense, but... 1. For that to actually work, we'd better call s

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Robert Haas
On Thu, Mar 3, 2022 at 1:28 PM Andres Freund wrote: > > I can't remember that verify() is the one that accesses conflict.db large > > while cause_eviction() is the one that accesses postgres.replace_sb for more > > than like 15 seconds. > > For more than 15seconds? The whole test runs in a few sec

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Andres Freund
Hi, On 2022-03-03 13:11:17 -0500, Robert Haas wrote: > On Wed, Mar 2, 2022 at 3:00 PM Andres Freund wrote: > > On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > > > - I am having some trouble understanding clearly what 0001 is doing. > > > I'll try to study it further. > > > > It tests for the v

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Robert Haas
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund wrote: > On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > > - I am having some trouble understanding clearly what 0001 is doing. > > I'll try to study it further. > > It tests for the various scenarios I could think of that could lead to FD > reuse, t

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Robert Haas
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund wrote: > What I am stuck on is what we can do for the released branches. Data > corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like > something we need to address. I think we should consider back-porting the ProcSignalBarrier stu

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Andres Freund
Hi, On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > - I am having some trouble understanding clearly what 0001 is doing. > I'll try to study it further. It tests for the various scenarios I could think of that could lead to FD reuse, to state the obvious ;). Anything particularly unclear. >

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Robert Haas
On Tue, Feb 22, 2022 at 4:40 AM Andres Freund wrote: > On 2022-02-22 01:11:21 -0800, Andres Freund wrote: > > I've started to work on a few debugging aids to find problem like > > these. Attached are two WIP patches: > > Forgot to attach. Also importantly includes a tap test for several of these >

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund
Hi, On 2022-02-22 01:11:21 -0800, Andres Freund wrote: > I've started to work on a few debugging aids to find problem like > these. Attached are two WIP patches: Forgot to attach. Also importantly includes a tap test for several of these issues Greetings, Andres Freund >From 0bc64874f8e5faae9a3

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund
Hi, On 2022-02-10 14:26:59 -0800, Andres Freund wrote: > On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > > It seems like I should go ahead and do that today, and we can study > > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? > > Yes. I wrote a test to show the problem. W

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Andres Freund
Hi, On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > I was about to commit that, because the original Windows problem it > solved is showing up occasionally in CI failures (that is, it already > solves a live problem, albeit a different and non-data-corrupting > one): +1 > It seems like I sho

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Andres Freund
Hi, On 2022-02-10 13:49:50 -0500, Robert Haas wrote: > I agree. While I feel sort of bad about missing this issue in review, > I also feel like it's pretty surprising that there isn't something > plugging this hole already. It feels unexpected that our FD management > layer might hand you an FD th

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Robert Haas
On Thu, Feb 10, 2022 at 3:11 PM Thomas Munro wrote: > On Fri, Feb 11, 2022 at 7:50 AM Robert Haas wrote: > > The main question in my mind is who is going to actually make that > > happen. It was your idea (I think), Thomas coded it, and my commit > > made it a live problem. So who's going to get

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Thomas Munro
On Fri, Feb 11, 2022 at 7:50 AM Robert Haas wrote: > The main question in my mind is who is going to actually make that > happen. It was your idea (I think), Thomas coded it, and my commit > made it a live problem. So who's going to get something committed > here? I was about to commit that, beca

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Robert Haas
On Wed, Feb 9, 2022 at 5:00 PM Andres Freund wrote: > The problem starts with > > commit aa01051418f10afbdfa781b8dc109615ca785ff9 > Author: Robert Haas > Date: 2022-01-24 14:23:15 -0500 > > pg_upgrade: Preserve database OIDs. Well, that's sad. > I think the most realistic way to address t

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-09 Thread Justin Pryzby
On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote: > On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to > a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink > == 0. You could also stat() the file in proc/self/fd/N and compare st_in

wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-09 Thread Andres Freund
Hi, I was working on rebasing the AIO branch. Tests started to fail after, but it turns out that the problem exists independent of AIO. The problem starts with commit aa01051418f10afbdfa781b8dc109615ca785ff9 Author: Robert Haas Date: 2022-01-24 14:23:15 -0500 pg_upgrade: Preserve databas