Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-05 Thread Robert Haas
On Tue, Apr 3, 2012 at 8:28 AM, Kevin Grittner wrote: > Might as well jump in with both feet: > > autovacuum_naptime = 1s > autovacuum_vacuum_threshold = 1 > autovacuum_vacuum_scale_factor = 0.0 > > If that smooths the latency peaks and doesn't hurt performance too > much, it's decent evidence tha

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-03 Thread Kevin Grittner
> Robert Haas wrote: > Kevin Grittner wrote: > >> I can't help thinking that the "background hinter" I had ideas >> about writing would prevent many of the reads of old CLOG pages, >> taking a lot of pressure off of this area. It just occurred to me >> that the difference between that idea and

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Robert Haas
On Mon, Apr 2, 2012 at 12:58 PM, Kevin Grittner wrote: > I can't help thinking that the "background hinter" I had ideas about > writing would prevent many of the reads of old CLOG pages, taking a > lot of pressure off of this area.  It just occurred to me that the > difference between that idea an

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Robert Haas
On Apr 2, 2012, at 3:16 PM, Simon Riggs wrote: > Agreed, though I think it means the fsync is happening on a filesystem > that causes a full system fsync. That time is not normal. It's ext4, which AFAIK does not have that problem. > ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hac

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Robert Haas
On Apr 2, 2012, at 3:04 PM, Tom Lane wrote: > Seems like basically what you've proven is that this code path *is* a > performance issue, and that we need to think a bit harder about how to > avoid doing the fsync while holding locks. Hmm, good idea. I wonder if we couldn't just hand off the fsync

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Jeff Janes
On Mon, Apr 2, 2012 at 12:04 PM, Tom Lane wrote: > Robert Haas writes: >> Long story short, when a CLOG-related stall happens, >> essentially all the time is being spent in this here section of code: > >>     /* >>      * If not part of Flush, need to fsync now.  We assume this happens >>      *

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Greg Stark
On Mon, Apr 2, 2012 at 8:16 PM, Simon Riggs wrote: > Agreed, though I think it means the fsync is happening on a filesystem > that causes a full system fsync. That time is not normal. I don't know what you mean. It looks like there are two cases where this code path executes. Either more than 16

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Tom Lane
Simon Riggs writes: > I suggest we optimise that by moving the dirty block into shared > buffers and leaving it as dirty. That way we don't need to write or > fsync at all and the bgwriter can pick up the pieces. So my earlier > patch to get the bgwriter to clean the clog would be superfluous. [

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Simon Riggs
On Mon, Apr 2, 2012 at 8:04 PM, Tom Lane wrote: > Robert Haas writes: >> Long story short, when a CLOG-related stall happens, >> essentially all the time is being spent in this here section of code: > >>     /* >>      * If not part of Flush, need to fsync now.  We assume this happens >>      * i

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Tom Lane
Robert Haas writes: > Long story short, when a CLOG-related stall happens, > essentially all the time is being spent in this here section of code: > /* > * If not part of Flush, need to fsync now. We assume this happens > * infrequently enough that it's not a performance issue. >

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Kevin Grittner
Robert Haas wrote: > This particular example shows the above chunk of code taking >13s > to execute. Within 3s, every other backend piles up behind that, > leading to the database getting no work at all done for a good ten > seconds. > > My guess is that what's happening here is that one backe

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Robert Haas
On Mon, Apr 2, 2012 at 7:01 AM, Simon Riggs wrote: > Do you consider this proof that it can only be I/O? Or do we still > need to find out? I stuck a bunch more debugging instrumentation into the SLRU code. It was fairly clear from the previous round of instrumentation that the problem was that

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Simon Riggs
On Mon, Apr 2, 2012 at 11:49 AM, Greg Stark wrote: > On Mon, Apr 2, 2012 at 8:15 AM, Simon Riggs wrote: >> Not true, please refer to code at line 544, as I already indicated. >> >> My understanding of the instrumentation is that the lock acquired at >> line 526 will show as the blocker until we r

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Greg Stark
On Mon, Apr 2, 2012 at 8:15 AM, Simon Riggs wrote: > Not true, please refer to code at line 544, as I already indicated. > > My understanding of the instrumentation is that the lock acquired at > line 526 will show as the blocker until we reach line 555, so anything > in between could be responsib

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Simon Riggs
On Mon, Apr 2, 2012 at 12:00 AM, Greg Stark wrote: > On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas wrote: >> My guess based on previous testing is >> that what's happening here is (1) we examine a tuple on an old page >> and decide we must look up its XID, (2) the relevant CLOG page isn't >> in cac

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Simon Riggs
On Sun, Apr 1, 2012 at 11:12 PM, Greg Stark wrote: > On Sun, Apr 1, 2012 at 10:27 PM, Simon Riggs wrote: >> So lock starvation on the control lock would cause a long wait after >> each I/O, making it look like an I/O problem. > > Except that both of the locks involved in his smoking gun occur > *

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-02 Thread Albe Laurenz
Robert Haas wrote: > I suppose one interesting question is to figure out if there's a way I > can optimize the disk configuration in this machine, or the Linux I/O > scheduler, or something, so as to reduce the amount of time it spends > waiting for the disk. I'd be curious to know if using the de

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-01 Thread Greg Stark
On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas wrote: > My guess based on previous testing is > that what's happening here is (1) we examine a tuple on an old page > and decide we must look up its XID, (2) the relevant CLOG page isn't > in cache so we decide to read it, but (3) the page we decide to

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-01 Thread Greg Stark
On Sun, Apr 1, 2012 at 10:27 PM, Simon Riggs wrote: > So lock starvation on the control lock would cause a long wait after > each I/O, making it look like an I/O problem. Except that both of the locks involved in his smoking gun occur *after* the control lock has already been acquired. The one th

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-01 Thread Simon Riggs
On Sun, Apr 1, 2012 at 1:34 PM, Robert Haas wrote: > On Sun, Apr 1, 2012 at 7:07 AM, Simon Riggs wrote: >> First, we need to determine that it is the clog where this is happening. > > I can confirm that based on the LWLockIds.  There were 32 of them > beginning at lock id 81, and a gdb session co

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-01 Thread Robert Haas
On Sun, Apr 1, 2012 at 7:07 AM, Simon Riggs wrote: > First, we need to determine that it is the clog where this is happening. I can confirm that based on the LWLockIds. There were 32 of them beginning at lock id 81, and a gdb session confirms that ClogCtlData->shared->buffer_locks[0..31] point t

Re: [HACKERS] measuring lwlock-related latency spikes

2012-04-01 Thread Simon Riggs
On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas wrote: > If I filter for waits greater than 8s, a somewhat different picture emerges: > >      2 waited at indexam.c:521 blocked by bufmgr.c:2475 >    212 waited at slru.c:310 blocked by slru.c:526 > > In other words, some of the waits for SLRU pages to

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Robert Haas
On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas wrote: > I've discovered a bug in my code that was causing it to print at most > 2 histogram buckets per lwlock, which obviously means that my previous > results were totally inaccurate.  Ah, the joys of benchmarking.  I > found the problem when I added

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Robert Haas
On Sat, Mar 31, 2012 at 6:01 PM, Greg Stark wrote: > On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas wrote: >>> Isn't that lock held while doing visibility checks? >> >> Nope.  heap_update() and friends do a very complicated little dance to >> avoid that. > ... >>> What about I/O >>> waiting for a

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Greg Stark
On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas wrote: >> Isn't that lock held while doing visibility checks? > > Nope.  heap_update() and friends do a very complicated little dance to > avoid that. ... >> What about I/O >> waiting for a clog page to be read? > > I'm pretty sure that can happen I'm

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Robert Haas
On Sat, Mar 31, 2012 at 8:58 AM, Greg Stark wrote: > On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote: >> But I didn't think we were ever supposed to hold content >> locks for that long. > > Isn't that lock held while doing visibility checks? Nope. heap_update() and friends do a very complica

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Simon Riggs
On Sat, Mar 31, 2012 at 1:58 PM, Greg Stark wrote: > On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote: >> But I didn't think we were ever supposed to hold content >> locks for that long. > > Isn't that lock held while doing visibility checks? What about I/O > waiting for a clog page to be read?

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Greg Stark
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote: > But I didn't think we were ever supposed to hold content > locks for that long. Isn't that lock held while doing visibility checks? What about I/O waiting for a clog page to be read? -- greg -- Sent via pgsql-hackers mailing list (pgsql-ha

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Robert Haas
On Sat, Mar 31, 2012 at 4:53 AM, Simon Riggs wrote: > The next logical step in measuring lock waits is to track the reason > for the lock wait, not just the lock wait itself. I had the same thought. I'm not immediately sure what the best way to do that is, but I'll see if I can figure something

Re: [HACKERS] measuring lwlock-related latency spikes

2012-03-31 Thread Simon Riggs
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote: > which means, if I'm not > confused here, that every single lwlock-related stall > 1s happened > while waiting for a buffer content lock.  Moreover, each event > affected a different buffer.  I find this result so surprising that I > have a har

[HACKERS] measuring lwlock-related latency spikes

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 1:55 PM, Robert Haas wrote: > Actually, what is really bugging me is that I cannot find any way of > getting a profile that reflects the *time* spent waiting rather than > merely the *number* of waits.  That seems like an obvious thing to > want, and I cannot find a way to