Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Thomas Munro
On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera wrote: > Thomas Munro wrote: > >> Thanks. As mentioned elsewhere in the thread, I discovered that the >> same problem exists for page boundaries, with a different error >> message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 >> an

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Alvaro Herrera
Thomas Munro wrote: > Thanks. As mentioned elsewhere in the thread, I discovered that the > same problem exists for page boundaries, with a different error > message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 > and master with the same results: > > FATAL: could not access s

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Alvaro Herrera
Alvaro Herrera wrote: > I see another hole in this area. See do_start_worker() -- there we only > consider the offsets limit to determine a database to be in > almost-wrapped-around state (causing emergency attention). If the > database in members trouble has no pgstat entry, it might get comple

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-15 Thread Robert Haas
On Fri, Jun 12, 2015 at 7:27 PM, Steve Kehlet wrote: > Just wanted to report that I rolled back my VM to where it was with 9.4.2 > installed and it wouldn't start. I installed 9.4.4 and now it starts up just > fine: > >> 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG: database system was shut down >>

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-15 Thread Alvaro Herrera
Andres Freund wrote: > A first version to address this problem can be found appended to this > email. > > Basically it does: > * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal > autovacuum once per members segment > * For both members and offsets, once hitting the hard limi

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-12 Thread Steve Kehlet
Just wanted to report that I rolled back my VM to where it was with 9.4.2 installed and it wouldn't start. I installed 9.4.4 and now it starts up just fine: > 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG: database system was shut down at 2015-05-27 13:12:55 PDT > 2015-06-12 16:05:58 PDT [6453]: [2-1

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-11 Thread Jeff Janes
On Wed, Jun 10, 2015 at 7:16 PM, Noah Misch wrote: > On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote: > > One more thing: > > Our testing infrastructure sucks. Without writing C code it's basically > > impossible to test wraparounds and such. Even if not particularly useful > > for

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-10 Thread Noah Misch
On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote: > One more thing: > Our testing infrastructure sucks. Without writing C code it's basically > impossible to test wraparounds and such. Even if not particularly useful > for non-devs, I really think we should have functions for creating

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Alvaro Herrera
Robert Haas wrote: > On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera > wrote: > > (My personal alarm bells go off when I see autovac_naptime=15min or > > more, but apparently not everybody sees things that way.) > > Uh, I'd echo that sentiment if you did s/15min/1min/ Yeah, well, that too I gue

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-08 14:23:32 -0300, Alvaro Herrera wrote: > Sure. I just concern that we might be putting excessive trust on > emergency workers being launched at a high pace. I'm not sure what to do about that. I mean, it'd not be hard to simply ignore naptime upon wraparound, but I'm not sure that'd

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Robert Haas
On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera wrote: > Andres Freund wrote: >> On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera >> wrote: >> >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER >> >only causes things to happen (i.e. a new worker to be started) when >> >autov

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Alvaro Herrera
Andres Freund wrote: > On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera > wrote: > >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER > >only causes things to happen (i.e. a new worker to be started) when > >autovacuum is disabled. If autovacuum is enabled, postmaster > >r

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera wrote: >Andres Freund wrote: > >> A first version to address this problem can be found appended to this >> email. >> >> Basically it does: >> * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal >> autovacuum once per member

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Alvaro Herrera
Andres Freund wrote: > A first version to address this problem can be found appended to this > email. > > Basically it does: > * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal > autovacuum once per members segment > * For both members and offsets, once hitting the hard limi

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-08 15:15:04 +0200, Andres Freund wrote: > 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with > autovacuum=off you can get into situations where emergency vacuums > aren't started when necessary. This is particularly likely to happen > if either very large multi

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-05 16:56:18 -0400, Tom Lane wrote: > Andres Freund writes: > > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas > > wrote: > >> I think we would be foolish to rush that part into the tree. We > >> probably got here in the first place by rushing the last round of > >> fixes too much

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-07 Thread Andres Freund
On 2015-06-05 20:47:33 +0200, Andres Freund wrote: > On 2015-06-05 14:33:12 -0400, Tom Lane wrote: > > Robert Haas writes: > > > 1. The problem that we might truncate an SLRU members page away when > > > it's in the buffers, but not drop it from the buffers, leading to a > > > failure when we try

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 4:40 PM, Andres Freund wrote: >>I think we would be foolish to rush that part into the tree. We >>probably got here in the first place by rushing the last round of >>fixes too much; let's try not to double down on that mistake. > > My problem with that approach is that I th

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Joshua D. Drake wrote: > I believe there are likely quite a few parties willing to help test, if we > knew how? The code involved is related to checkpoints, pg_basebackups that take a long time to run, and multixact freezing and truncation. If you can set up test servers that eat lots of multixa

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Joshua D. Drake
On 06/05/2015 01:56 PM, Tom Lane wrote: If we have confidence that we can ship something on Monday that is materially more trustworthy than the current releases, then let's aim to do that; but let's ship only patches we are confident in. We can do another set of releases later that incorporate

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Tom Lane
Andres Freund writes: > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas > wrote: >> I think we would be foolish to rush that part into the tree. We >> probably got here in the first place by rushing the last round of >> fixes too much; let's try not to double down on that mistake. > My prob

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas wrote: >On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund >wrote: >> On 2015-06-05 14:33:12 -0400, Tom Lane wrote: >>> Robert Haas writes: >>> > 1. The problem that we might truncate an SLRU members page away >when >>> > it's in the buffers, but no

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:36 PM, Alvaro Herrera wrote: > Tom Lane wrote: >> Robert Haas writes: > >> > There are at least two other known issues that seem like they should >> > be fixed before we release: >> >> > 1. The problem that we might truncate an SLRU members page away when >> > it's in the

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund wrote: > On 2015-06-05 14:33:12 -0400, Tom Lane wrote: >> Robert Haas writes: >> > 1. The problem that we might truncate an SLRU members page away when >> > it's in the buffers, but not drop it from the buffers, leading to a >> > failure when we try t

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Steve Kehlet
On Fri, Jun 5, 2015 at 11:47 AM Andres Freund wrote: > But I'd definitely like some > independent testing for it, and I'm not sure if that's doable in time > for the wrap. > I'd be happy to test on my database that was broken, for however much that helps. It's a VM so I can easily revert back as

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Robert Haas wrote: > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: > > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: > >> Here's a new version with some more fixes and improvements: > > > > I read through this version and found nothing to change. I encourage other > > hackers t

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On 2015-06-05 14:33:12 -0400, Tom Lane wrote: > Robert Haas writes: > > 1. The problem that we might truncate an SLRU members page away when > > it's in the buffers, but not drop it from the buffers, leading to a > > failure when we try to write it later. I've got a fix for this, and about three

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Tom Lane wrote: > Robert Haas writes: > > There are at least two other known issues that seem like they should > > be fixed before we release: > > > 1. The problem that we might truncate an SLRU members page away when > > it's in the buffers, but not drop it from the buffers, leading to a > > fa

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Tom Lane
Robert Haas writes: > On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund wrote: >> On 2015-06-05 11:43:45 -0400, Tom Lane wrote: >>> So where are we on this? Are we ready to schedule a new set of >>> back-branch releases? If not, what issues remain to be looked at? >> We're currently still doing b

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund wrote: > On 2015-06-05 11:43:45 -0400, Tom Lane wrote: >> Robert Haas writes: >> > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: >> >> I read through this version and found nothing to change. I encourage >> >> other >> >> hackers to study the

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On 2015-06-05 11:43:45 -0400, Tom Lane wrote: > Robert Haas writes: > > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: > >> I read through this version and found nothing to change. I encourage other > >> hackers to study the patch, though. The surrounding code is challenging. > > > Andres t

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Tom Lane
Robert Haas writes: > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: >> I read through this version and found nothing to change. I encourage other >> hackers to study the patch, though. The surrounding code is challenging. > Andres tested this and discovered that my changes to > find_multix

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: >> Here's a new version with some more fixes and improvements: > > I read through this version and found nothing to change. I encourage other > hackers to study the patch, though. The

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Thomas Munro
On Fri, Jun 5, 2015 at 1:47 PM, Thomas Munro wrote: > On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro > wrote: >> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas wrote: >>> Here's a new version with some more fixes and improvements: >>> [...] >> >> With this patch, when I run the script >> "checkpoint

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Noah Misch
On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: > Here's a new version with some more fixes and improvements: I read through this version and found nothing to change. I encourage other hackers to study the patch, though. The surrounding code is challenging. > With this version, I'm

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro wrote: > On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas wrote: >> Here's a new version with some more fixes and improvements: >> >> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset >> when the oldest offset became known if the now-k

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas wrote: > Here's a new version with some more fixes and improvements: > > - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset > when the oldest offset became known if the now-known value happened to > be zero. Fixed. > > - SetOffsetVac

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas wrote: > - Forces aggressive autovacuuming when the control file's > oldestMultiXid doesn't point to a valid MultiXact and enables member > wraparound at the next checkpoint following the correction of that > problem. Err, enables member wraparound *pro

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas wrote: > On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas wrote: >> Thanks for the review. > > Here's a new version. I've fixed the things Alvaro and Noah noted, > and some compiler warnings about set but unused variables. > > I also tested it, and it does

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Alvaro Herrera
Alvaro Herrera wrote: > Robert Haas wrote: > > > So here's a patch taking a different approach. > > I tried to apply this to 9.3 but it's messy because of pgindent. Anyone > would have a problem with me backpatching a pgindent run of multixact.c? Done. -- Álvaro Herrerahttp://

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund wrote: > On 2015-06-04 12:57:42 -0400, Robert Haas wrote: >> + /* >> + * Do we need an emergency autovacuum? If we're not sure, assume yes. >> + */ >> + return !oldestOffsetKnown || >> + (nextOffset - oldestOffset > MULTI

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Andres Freund
Hi, On 2015-06-04 12:57:42 -0400, Robert Haas wrote: > + /* > + * Do we need an emergency autovacuum? If we're not sure, assume yes. > + */ > + return !oldestOffsetKnown || > + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD); I think without teaching a

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas wrote: > Thanks for the review. Here's a new version. I've fixed the things Alvaro and Noah noted, and some compiler warnings about set but unused variables. I also tested it, and it doesn't quite work as hoped. If started on a cluster where oldestMu

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch wrote: > I like that change a lot. It's much easier to seek forgiveness for wasting <= > 28 GiB of disk than for deleting visibility information wrongly. I'm glad you like it. I concur. >> 2. If setting the offset stop limit (the point where we refuse

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Noah Misch
On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote: > So here's a patch taking a different approach. In this approach, if > the multixact whose members we want to look up doesn't exist, we don't > use a later one (that might or might not be valid). Instead, we > attempt to cope with the

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch wrote: > While testing this (with inconsistent-multixact-fix-master.patch applied, > FWIW), I noticed a nearby bug with a similar symptom. TruncateMultiXact() > omits the nextMXact==oldestMXact special case found in each other > find_multixact_start() ca

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Robert Haas wrote: > So here's a patch taking a different approach. I tried to apply this to 9.3 but it's messy because of pgindent. Anyone would have a problem with me backpatching a pgindent run of multixact.c? Also, you have a new function SlruPageExists, but we already have SimpleLruDoesPhy

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas wrote: > On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund wrote: >>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning >>> > the disk it'll always get one at a segment boundary, right? I'm not sure >>> > that's actually ok; because th

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote: > On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: > > One idea I had was: what if the oldestMulti pointed to another multi > > earlier in the same 0046 file, so that it is read-as-zeroes (and the > > file is created), and then a subsequent multixact truncate tries to read

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > That's not necessarily the case though, given how the code currently > > works. In a bunch of places the SLRUs are accessed *before* having been > > made consistent by WAL replay. Especially if several checkpoints/vacuum

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Alvaro Herrera wrote: > Really, the whole question of how this code goes past the open() failure > in SlruPhysicalReadPage baffles me. I don't see any possible way for > the file to be created ... Hmm, the checkpointer can call TruncateMultiXact when in recovery, on restartpoints. I wonder if in

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote: > On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: > > Thomas Munro wrote: > > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera > > > wrote: > > > > My guess is that the file existed, and perhaps had one or more pages, > > > > but the wanted page doesn't exist, so we tried

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: > Thomas Munro wrote: > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera > > wrote: > > > My guess is that the file existed, and perhaps had one or more pages, > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > > ba

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Thomas Munro wrote: > I have finally reproduced that error! See attached repro shell script. > > The conditions are: > > 1. next multixact == oldest multixact (no active multixacts, pointing > past the end) > 2. next multixact would be the first item on a new page (multixact % 2048 == > 0) >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund wrote: >> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning >> > the disk it'll always get one at a segment boundary, right? I'm not sure >> > that's actually ok; because the value at the beginning of the segment >> > can very wel

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro wrote: > On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera > wrote: >> Thomas Munro wrote: >>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera >>> wrote: >>> > My guess is that the file existed, and perhaps had one or more pages, >>> > but the wanted pa

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera wrote: > Thomas Munro wrote: >> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera >> wrote: >> > My guess is that the file existed, and perhaps had one or more pages, >> > but the wanted page doesn't exist, so we tried to read but got 0 bytes >> > back

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Alvaro Herrera
Thomas Munro wrote: > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera > wrote: > > My guess is that the file existed, and perhaps had one or more pages, > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > back. read() returns 0 in this case but doesn't set errno. > > >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro
On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera wrote: > My guess is that the file existed, and perhaps had one or more pages, > but the wanted page doesn't exist, so we tried to read but got 0 bytes > back. read() returns 0 in this case but doesn't set errno. > > I didn't find a way to set things

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning > > the disk it'll always get one at a segment boundary, right? I'm not sure > > that's actually ok; because the value at the beginning of the segment > > can very well end up being a 0, as MaybeExtendOffsetSlru() will have >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund wrote: > I'm not really convinced tying things closer to having done trimming is > easier to understand than tying things to recovery having finished. > > E.g. > if (did_trim) > oldestOffset = GetOldestReferencedOffset(oldest_da

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9 > Author: Robert Haas > Date: Fri May 29 14:35:53 2015 -0400 > > foo Hehe! > diff --git a/src/backend/access/transam/multixact.c > b/src/backend/access/transam/multixact.c > index 9568ff1.

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:49:56 -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund wrote: > > On 2015-06-02 11:37:02 -0400, Robert Haas wrote: > >> The exact circumstances under which we're willing to replace a > >> relminmxid with a newly-computed one that differs are not altogethe

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Noah Misch
On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch wrote: > > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > > Granted. Would it be better to update both functions at the same time, and > > perhaps to make that a master-only

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund wrote: > On 2015-06-02 11:37:02 -0400, Robert Haas wrote: >> The exact circumstances under which we're willing to replace a >> relminmxid with a newly-computed one that differs are not altogether >> clear to me, but there's an "if" statement protectin

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund wrote: >> That would be a departure from the behavior of every existing release >> that includes this code based on, to my knowledge, zero trouble >> reports. > > On the other hand we're now at about bug #5 attributeable to the odd way > truncation wo

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:37:02 -0400, Robert Haas wrote: > The exact circumstances under which we're willing to replace a > relminmxid with a newly-computed one that differs are not altogether > clear to me, but there's an "if" statement protecting that logic, so > there are some circumstances in which we'

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund wrote: > On 2015-06-02 11:16:22 -0400, Robert Haas wrote: >> I'm having trouble figuring out what to do about this. I mean, the >> essential principle of this patch is that if we can't count on >> relminmxid, datminmxid, or the control file to be acc

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:29:24 -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund wrote: > > But what *definitely* looks wrong to me is that a TruncateMultiXact() in > > this scenario now (since a couple weeks ago) does a > > SimpleLruReadPage_ReadOnly() in the members slru via > >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund wrote: > But what *definitely* looks wrong to me is that a TruncateMultiXact() in > this scenario now (since a couple weeks ago) does a > SimpleLruReadPage_ReadOnly() in the members slru via > find_multixact_start(). That just won't work acceptably whe

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:16:22 -0400, Robert Haas wrote: > I'm having trouble figuring out what to do about this. I mean, the > essential principle of this patch is that if we can't count on > relminmxid, datminmxid, or the control file to be accurate, we can at > least look at what is present on the disk

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch wrote: > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: >> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch wrote: >> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: >> >> SetMultiXactIdLimit() bracketed certain parts of its >> >>

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund wrote: > > The lack of WAL logging actually has caused problems in the 9.3.3 (?) > > era, where we didn't do any truncation during recovery... > > Right, but now we're piggybacking on the checkpoint r

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Noah Misch
On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch wrote: > > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > >> SetMultiXactIdLimit() bracketed certain parts of its > >> logic with if (!InRecovery), but those guards were ineff

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote: > Anyway here's a quick script to almost-reproduce the problem. Meh. Really attached now. I also wanted to post the error messages we got: 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: resto

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote: > Robert Haas wrote: > > In the process of investigating this, we found a few other things that > > seem like they may also be bugs: > > > > - As noted upthread, replaying an older checkpoint after a newer > > checkpoint has already happened may lead to similar problems. Th

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Thomas Munro wrote: > > - There's a third possible problem related to boundary cases in > > SlruScanDirCbRemoveMembers, but I don't understand that one well > > enough to explain it. Maybe Thomas can jump in here and explain the > > concern. > > I noticed something in passing which is probably n

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund wrote: >> I'm probably biased here, but I think we should finish reviewing, >> testing, and committing my patch before we embark on designing this. > > Probably, yes. I am wondering whether doing this immediately won't end > up making some things simpl

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch wrote: > Incomplete review, done in a relative rush: Thanks. > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: >> OK, here's a patch. Actually two patches, differing only in >> whitespace, for 9.3 and for master (ha!). I now think that t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Andres Freund
On 2015-05-31 07:51:59 -0400, Robert Haas wrote: > > 1) We continue determining the oldest > > SlruScanDirectory(SlruScanDirCbFindEarliest) > >on the master to find the oldest offsets segment to > >truncate. Alternatively, if we determine it to be safe, we could use > >oldestMulti to f

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch
On Fri, May 29, 2015 at 10:37:57AM +1200, Thomas Munro wrote: > On Fri, May 29, 2015 at 7:56 AM, Robert Haas wrote: > > - There's a third possible problem related to boundary cases in > > SlruScanDirCbRemoveMembers, but I don't understand that one well > > enough to explain it. Maybe Thomas can j

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch
Incomplete review, done in a relative rush: On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > OK, here's a patch. Actually two patches, differing only in > whitespace, for 9.3 and for master (ha!). I now think that the root > of the problem here is that DetermineSafeOldestOffset() a

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Robert Haas
On Sat, May 30, 2015 at 8:55 PM, Andres Freund wrote: > Is oldestMulti, nextMulti - 1 really suitable for this? Are both > actually guaranteed to exist in the offsets slru and be valid? Hm. I > guess you intend to simply truncate everything else, but just in > offsets? oldestMulti in theory is t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-30 Thread Andres Freund
On 2015-05-30 00:52:37 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > I considered for a second whether the solution for that could be to not > > truncate while inconsistent - but I think that doesn't solve anything as > > then we can end up with directories where every single offsets/me

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Bruce Momjian wrote: > I think we need to step back and look at the brain power required to > unravel the mess we have made regarding multi-xact and fixes. (I bet > few people can even remember which multi-xact fixes went into which > releases --- I can't.) Instead of working on actual features,

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Andres Freund wrote: > I considered for a second whether the solution for that could be to not > truncate while inconsistent - but I think that doesn't solve anything as > then we can end up with directories where every single offsets/member > file exists. Hang on a minute. We don't need to scan

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 1:46 PM, Andres Freund wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 9:46 PM, Andres Freund wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 3:08 PM, Robert Haas wrote: > It won't fix the fact that pg_upgrade is putting > a wrong value into everybody's datminmxid field, which should really > be addressed too, but I've been working on this for about three days > virtually non-stop and I don't have the energy to t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:08:11 -0400, Robert Haas wrote: > It seems pretty clear that we can't effectively determine anything > about member wraparound until the cluster is consistent. I wonder if this doesn't actually hints at a bigger problem. Currently, to determine where we need to truncate SlruScanD

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote: > I think we need to step back and look at the brain power required to > unravel the mess we have made regarding multi-xact and fixes. (I bet > few people can even remember which multi-xact fixes went into which > releases --- I can't.) Instead o

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Steve Kehlet
On Fri, May 29, 2015 at 12:08 PM Robert Haas wrote: > OK, here's a patch. > I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific patch. I rebuilt, installed, and postgres started up successfully! I did a bunch of checks, had our app run several thousand SQL queries against

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Bruce Momjian
On Thu, May 28, 2015 at 07:24:26PM -0400, Robert Haas wrote: > On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake > wrote: > > FTR: Robert, you have been a Samurai on this issue. Our many thanks. > > Thanks! I really appreciate the kind words. > > So, in thinking through this situation further,

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 12:43 PM, Robert Haas wrote: > Working on that now. OK, here's a patch. Actually two patches, differing only in whitespace, for 9.3 and for master (ha!). I now think that the root of the problem here is that DetermineSafeOldestOffset() and SetMultiXactIdLimit() were larg

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 10:17 AM, Tom Lane wrote: > Thomas Munro writes: >> On Fri, May 29, 2015 at 11:24 AM, Robert Haas wrote: >>> B. We need to change find_multixact_start() to fail softly. > >> Here is an experimental WIP patch that changes StartupMultiXact and >> SetMultiXactIdLimit to find

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Tom Lane
Thomas Munro writes: > On Fri, May 29, 2015 at 11:24 AM, Robert Haas wrote: >> B. We need to change find_multixact_start() to fail softly. > Here is an experimental WIP patch that changes StartupMultiXact and > SetMultiXactIdLimit to find the oldest multixact that exists on disk > (by scanning t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Fri, May 29, 2015 at 11:24 AM, Robert Haas wrote: > A. Most obviously, we should fix pg_upgrade so that it installs > chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so > that we stop creating new instances of this problem. That won't get > us out of the hole we've dug for ours

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera wrote: >> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid >> values which are equal to the next-mxid counter instead of the correct >> value; in other words, they are too new. > > [ discussion of how the control file's oldestMul

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Alvaro Herrera wrote: > Robert Haas wrote: > > > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > > values which are equal to the next-mxid counter instead of the correct > > value; in other words, they are too new. > > What you describe is what happens if you upgrade from 9

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote: > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > values which are equal to the next-mxid counter instead of the correct > value; in other words, they are too new. What you describe is what happens if you upgrade from 9.2 or earlier. For this case we use

  1   2   >