On Tue, Apr 3, 2012 at 8:28 AM, Kevin Grittner
wrote:
> Might as well jump in with both feet:
>
> autovacuum_naptime = 1s
> autovacuum_vacuum_threshold = 1
> autovacuum_vacuum_scale_factor = 0.0
>
> If that smooths the latency peaks and doesn't hurt performance too
> much, it's decent evidence tha
> Robert Haas wrote:
> Kevin Grittner wrote:
>
>> I can't help thinking that the "background hinter" I had ideas
>> about writing would prevent many of the reads of old CLOG pages,
>> taking a lot of pressure off of this area. It just occurred to me
>> that the difference between that idea and
On Mon, Apr 2, 2012 at 12:58 PM, Kevin Grittner
wrote:
> I can't help thinking that the "background hinter" I had ideas about
> writing would prevent many of the reads of old CLOG pages, taking a
> lot of pressure off of this area. It just occurred to me that the
> difference between that idea an
On Apr 2, 2012, at 3:16 PM, Simon Riggs wrote:
> Agreed, though I think it means the fsync is happening on a filesystem
> that causes a full system fsync. That time is not normal.
It's ext4, which AFAIK does not have that problem.
>
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hac
On Apr 2, 2012, at 3:04 PM, Tom Lane wrote:
> Seems like basically what you've proven is that this code path *is* a
> performance issue, and that we need to think a bit harder about how to
> avoid doing the fsync while holding locks.
Hmm, good idea. I wonder if we couldn't just hand off the fsync
On Mon, Apr 2, 2012 at 12:04 PM, Tom Lane wrote:
> Robert Haas writes:
>> Long story short, when a CLOG-related stall happens,
>> essentially all the time is being spent in this here section of code:
>
>> /*
>> * If not part of Flush, need to fsync now. We assume this happens
>> *
On Mon, Apr 2, 2012 at 8:16 PM, Simon Riggs wrote:
> Agreed, though I think it means the fsync is happening on a filesystem
> that causes a full system fsync. That time is not normal.
I don't know what you mean. It looks like there are two cases where
this code path executes. Either more than 16
Simon Riggs writes:
> I suggest we optimise that by moving the dirty block into shared
> buffers and leaving it as dirty. That way we don't need to write or
> fsync at all and the bgwriter can pick up the pieces. So my earlier
> patch to get the bgwriter to clean the clog would be superfluous.
[
On Mon, Apr 2, 2012 at 8:04 PM, Tom Lane wrote:
> Robert Haas writes:
>> Long story short, when a CLOG-related stall happens,
>> essentially all the time is being spent in this here section of code:
>
>> /*
>> * If not part of Flush, need to fsync now. We assume this happens
>> * i
Robert Haas writes:
> Long story short, when a CLOG-related stall happens,
> essentially all the time is being spent in this here section of code:
> /*
> * If not part of Flush, need to fsync now. We assume this happens
> * infrequently enough that it's not a performance issue.
>
Robert Haas wrote:
> This particular example shows the above chunk of code taking >13s
> to execute. Within 3s, every other backend piles up behind that,
> leading to the database getting no work at all done for a good ten
> seconds.
>
> My guess is that what's happening here is that one backe
On Mon, Apr 2, 2012 at 7:01 AM, Simon Riggs wrote:
> Do you consider this proof that it can only be I/O? Or do we still
> need to find out?
I stuck a bunch more debugging instrumentation into the SLRU code. It
was fairly clear from the previous round of instrumentation that the
problem was that
On Mon, Apr 2, 2012 at 11:49 AM, Greg Stark wrote:
> On Mon, Apr 2, 2012 at 8:15 AM, Simon Riggs wrote:
>> Not true, please refer to code at line 544, as I already indicated.
>>
>> My understanding of the instrumentation is that the lock acquired at
>> line 526 will show as the blocker until we r
On Mon, Apr 2, 2012 at 8:15 AM, Simon Riggs wrote:
> Not true, please refer to code at line 544, as I already indicated.
>
> My understanding of the instrumentation is that the lock acquired at
> line 526 will show as the blocker until we reach line 555, so anything
> in between could be responsib
On Mon, Apr 2, 2012 at 12:00 AM, Greg Stark wrote:
> On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas wrote:
>> My guess based on previous testing is
>> that what's happening here is (1) we examine a tuple on an old page
>> and decide we must look up its XID, (2) the relevant CLOG page isn't
>> in cac
On Sun, Apr 1, 2012 at 11:12 PM, Greg Stark wrote:
> On Sun, Apr 1, 2012 at 10:27 PM, Simon Riggs wrote:
>> So lock starvation on the control lock would cause a long wait after
>> each I/O, making it look like an I/O problem.
>
> Except that both of the locks involved in his smoking gun occur
> *
Robert Haas wrote:
> I suppose one interesting question is to figure out if there's a way I
> can optimize the disk configuration in this machine, or the Linux I/O
> scheduler, or something, so as to reduce the amount of time it spends
> waiting for the disk.
I'd be curious to know if using the de
On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas wrote:
> My guess based on previous testing is
> that what's happening here is (1) we examine a tuple on an old page
> and decide we must look up its XID, (2) the relevant CLOG page isn't
> in cache so we decide to read it, but (3) the page we decide to
On Sun, Apr 1, 2012 at 10:27 PM, Simon Riggs wrote:
> So lock starvation on the control lock would cause a long wait after
> each I/O, making it look like an I/O problem.
Except that both of the locks involved in his smoking gun occur
*after* the control lock has already been acquired. The one th
On Sun, Apr 1, 2012 at 1:34 PM, Robert Haas wrote:
> On Sun, Apr 1, 2012 at 7:07 AM, Simon Riggs wrote:
>> First, we need to determine that it is the clog where this is happening.
>
> I can confirm that based on the LWLockIds. There were 32 of them
> beginning at lock id 81, and a gdb session co
On Sun, Apr 1, 2012 at 7:07 AM, Simon Riggs wrote:
> First, we need to determine that it is the clog where this is happening.
I can confirm that based on the LWLockIds. There were 32 of them
beginning at lock id 81, and a gdb session confirms that
ClogCtlData->shared->buffer_locks[0..31] point t
On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas wrote:
> If I filter for waits greater than 8s, a somewhat different picture emerges:
>
> 2 waited at indexam.c:521 blocked by bufmgr.c:2475
> 212 waited at slru.c:310 blocked by slru.c:526
>
> In other words, some of the waits for SLRU pages to
On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas wrote:
> I've discovered a bug in my code that was causing it to print at most
> 2 histogram buckets per lwlock, which obviously means that my previous
> results were totally inaccurate. Ah, the joys of benchmarking. I
> found the problem when I added
On Sat, Mar 31, 2012 at 6:01 PM, Greg Stark wrote:
> On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas wrote:
>>> Isn't that lock held while doing visibility checks?
>>
>> Nope. heap_update() and friends do a very complicated little dance to
>> avoid that.
> ...
>>> What about I/O
>>> waiting for a
On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas wrote:
>> Isn't that lock held while doing visibility checks?
>
> Nope. heap_update() and friends do a very complicated little dance to
> avoid that.
...
>> What about I/O
>> waiting for a clog page to be read?
>
> I'm pretty sure that can happen
I'm
On Sat, Mar 31, 2012 at 8:58 AM, Greg Stark wrote:
> On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote:
>> But I didn't think we were ever supposed to hold content
>> locks for that long.
>
> Isn't that lock held while doing visibility checks?
Nope. heap_update() and friends do a very complica
On Sat, Mar 31, 2012 at 1:58 PM, Greg Stark wrote:
> On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote:
>> But I didn't think we were ever supposed to hold content
>> locks for that long.
>
> Isn't that lock held while doing visibility checks? What about I/O
> waiting for a clog page to be read?
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote:
> But I didn't think we were ever supposed to hold content
> locks for that long.
Isn't that lock held while doing visibility checks? What about I/O
waiting for a clog page to be read?
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-ha
On Sat, Mar 31, 2012 at 4:53 AM, Simon Riggs wrote:
> The next logical step in measuring lock waits is to track the reason
> for the lock wait, not just the lock wait itself.
I had the same thought. I'm not immediately sure what the best way to
do that is, but I'll see if I can figure something
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas wrote:
> which means, if I'm not
> confused here, that every single lwlock-related stall > 1s happened
> while waiting for a buffer content lock. Moreover, each event
> affected a different buffer. I find this result so surprising that I
> have a har
On Fri, Mar 30, 2012 at 1:55 PM, Robert Haas wrote:
> Actually, what is really bugging me is that I cannot find any way of
> getting a profile that reflects the *time* spent waiting rather than
> merely the *number* of waits. That seems like an obvious thing to
> want, and I cannot find a way to
31 matches
Mail list logo