Can PostgreSQL create new WAL files instead of reusing old ones?
tl;dr: We've found that under many conditions, PostgreSQL's re-use of old WAL files appears to significantly degrade query latency on ZFS. The reason is complicated and I have details below. Has it been considered to make this behavior tunable, to cause PostgreSQL to always create new WAL files instead of re-using old ones? Context: we're running a large fleet of PostgreSQL shards. Each shard consists of a primary, a synchronous standby, and an asynchronous standby using chained replication. For this problem, we can consider only the primary and synchronous standby. PostgreSQL: 9.6.3 OS: illumos (SmartOS, mixed versions, but all from 2017 or later) FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match PostgreSQL's record size. We have an SSD log device for completing synchronous writes very quickly. WAL files are 16MB each, and we're keeping a very large number of segments. (There's likely a lot of improvement for WAL tuning here.) Since we're using an 8K record size, when PostgreSQL writes small (or non-aligned) records to the WAL files, ZFS has to read the old contents in order to write the new 8K record (read-modify-write). If that's in cache, that's not a big deal. But when PostgreSQL decides to reuse an old WAL file whose contents have been evicted from the cache (because they haven't been used in hours), this turns what should be a workload bottlenecked by synchronous write performance (that's well-optimized with our SSD log device) into a random read workload (that's much more expensive for any system). What's significantly worse is that we saw this on synchronous standbys. When that happened, the WAL receiver was blocked on a random read from disk, and since it's single-threaded, all write queries on the primary stop until the random read finishes. This is particularly bad for us when the sync is doing other I/O (e.g., for an autovacuum or a database backup) that causes disk reads to take hundreds of milliseconds. Reusing old WAL files seems like an optimization intended for filesystems that allocate disk blocks up front. With copy-on-write, it doesn't seem to make much sense. If instead of using an old WAL file, PostgreSQL instead just created a new one, there would be no random reads required to complete these operations, and we believe we'd completely eliminate our latency outliers. Thanks, Dave
Re: Can PostgreSQL create new WAL files instead of reusing old ones?
On Tue, Apr 17, 2018 at 11:57 AM, Tom Lane wrote: > Alvaro Herrera writes: > > David Pacheco wrote: > >> tl;dr: We've found that under many conditions, PostgreSQL's re-use of > old > >> WAL > >> files appears to significantly degrade query latency on ZFS. The > reason is > >> complicated and I have details below. Has it been considered to make > this > >> behavior tunable, to cause PostgreSQL to always create new WAL files > >> instead of re-using old ones? > > > I don't think this has ever been proposed, because there was no use case > > for it. Maybe you want to work on a patch for it? > > I think possibly the OP doesn't understand why it's designed that way. > The point is not really to "recycle old WAL files", it's to avoid having > disk space allocation occur during the critical section where we must > PANIC on failure. Now, of course, that doesn't really work if the > filesystem is COW underneath, because it's allocating fresh disk space > anyway even though semantically we're overwriting existing data. > But what I'd like to see is a fix that deals with that somehow, rather > than continue to accept the possibility of ENOSPC occurring inside WAL > writes on these file systems. I have no idea what such a fix would > look like :-( I think I do understand, but as you've observed, recycling WAL files to avoid allocation relies on the implementation details of the filesystem -- details that I'd expect not to be true of any copy-on-write filesystem. On such systems, there may not be a way to avoid ENOSPC in special critical sections. (And that's not necessarily such a big deal -- to paraphrase a colleague, ensuring that the system doesn't run out of space does not seem like a particularly surprising or heavy burden for the operator. It's great that PostgreSQL can survive this event better on some systems, but the associated tradeoffs may not be worthwhile for everybody.) And given that, it seems worthwhile to provide the operator an option where they take on the risk that the database might crash if it runs out of space (assuming the result isn't data corruption) in exchange for a potentially tremendous improvement in tail latency and overall throughput. To quantify this: in a recent incident, transaction latency on the primary was degraded about 2-3x (from a p90 of about 45ms to upwards of 120ms, with outliers exceeding 1s). Over 95% of the outliers above 1s spent over 90% of their time blocked on synchronous replication (based on tracing with DTrace). On the synchronous standby, almost 10% of the WAL receiver's wall clock time was spent blocked on disk reads in this read-modify-write path. The rest of the time was essentially idle -- there was plenty of headroom in other dimensions (CPU, synchronous write performance). Thanks, Dave
Re: Can PostgreSQL create new WAL files instead of reusing old ones?
On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek wrote: > As Dave described in his original email on this topic, we'd like to avoid > recycling WAL files since that can cause performance issues when we have a > read-modify-write on a file that has dropped out of the cache. > > I have implemented a small change to allow WAL recycling to be disabled. > It is visible at: > https://cr.joyent.us/#/c/4263/ > > I'd appreciate getting any feedback on this. > > Thanks, > Jerry > > For reference, there's more context in this thread from several months ago: https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#cacukrjo7djvub8e2aijoayj8bfkk3xxbtwu3kkaritr67m3...@mail.gmail.com I'll repeat the relevant summary here: tl;dr: We've found that under many conditions, PostgreSQL's re-use of old > WAL files appears to significantly degrade query latency on ZFS. The > reason is > complicated and I have details below. Has it been considered to make this > behavior tunable, to cause PostgreSQL to always create new WAL files > instead of re-using old ones? Thanks, Dave
Autovacuum degrades all other operations by keeping all buffers dirty?
Hello, We've been struggling with some major performance issues related to autovacuum. (I know this is a common problem.) For a while, we believed this was primarily related to I/O contention, but recent observations from our monitoring make me wonder if there's a deeper issue here, so I'm looking for some better understanding. >From reading the 9.6.3 source, it looks like the autovacuum process itself is single-threaded, and it reads pages essentially linearly from the relation (possibly skipping some). When the autovacuum process needs to modify a page, it doesn't write it directly, but rather marks the buffer dirty. The page will be written later, either by the checkpointer (according to its configuration, in terms of time and WAL), the bgwriter (according to its configuration, in terms of write count and sleep time), or else some other backend process that requires a free buffer (if the count of non-dirty buffers reaches zero). Is this accurate? In our case, we found that when autovacuum runs, the number of dirty buffers written by regular backends shoots from a handful to as much as 300 buffers per second. (We have 200 backends on most databases.) More specifically: prior to recent autovacuums starting, databases did under 1000 buffer allocations per second. Most of those were done by the checkpointer. Individual backends did just a handful. After autovacuum started, buffer allocations rose to between 2,000 and 8,000 per second. The checkpointer handled many of these, but so did individual backends (see above). The bgwriter peaked around 25 buffer writes per second. So it seems like the spike in buffers written by normal backends could explain the significant degradation that we see in average latency and overall throughput (which can be as much as 50%). It looks to me like the autovacuum process is effectively generating work (in the form of async writes) that's being distributed implicitly to the various backend processes, creating latency for any other query that happens to require a buffer (including read-only queries). Maybe we can improve this by tuning the bgwriter. But if that's single-threaded, presumably there's an upper bound to how many buffer writes it can handle? Is there anything to guarantee that the bgwriter will keep up with the work that's being generated? That is, for some workloads, wouldn't it be possible that autovacuum could always generate work faster than the bgwriter can do it, and you'd always have some pretty major degradation to all other queries? I've drawn a lot of inferences here, and I'm sure there are some mistakes. I'd appreciate any clarification, correction, or confirmation! Thanks in advance, Dave
Re: Autovacuum degrades all other operations by keeping all buffers dirty?
On Fri, Aug 31, 2018 at 3:50 PM, Andres Freund wrote: > On 2018-08-31 19:31:47 -0300, Alvaro Herrera wrote: > > On 2018-Aug-31, David Pacheco wrote: > > > > > From reading the 9.6.3 source, it looks like the autovacuum process > > > itself is single-threaded, and it reads pages essentially linearly > > > from the relation (possibly skipping some). When the autovacuum > > > process needs to modify a page, it doesn't write it directly, but > > > rather marks the buffer dirty. The page will be written later, > > > > Unless there's some bug, there is a BufferAccessStrategy that only lets > > a few dozen buffers go unwritten before the autovac worker process > > itself is forced to write some. > > I've not re-checked, but I'm not sure that's true if the buffer is > already in s_b, which it'll be for many workloads. > Does that mean this analysis from above is accurate? It looks to me like the autovacuum process is effectively generating work > (in > the form of async writes) that's being distributed implicitly to the > various > backend processes, creating latency for any other query that happens to > require > a buffer (including read-only queries). Thanks, Dave
Re: [GENERAL] postmaster deadlock while logging after syslogger exited
Responding to several points below. Tom Lane wrote: > David's report isn't too clear: did the syslogger process actually run > out of memory and exit of its own volition after an ENOMEM, or did it get > killed by the dreaded OOM killer? In either case, it's unclear whether > it was really using an excessive amount of memory. We have not heard > reports suggesting a memory leak in the syslogger, but maybe there is > one under unusual circumstances? I'm running on illumos, where there is no OOM killer. I expect what happened is that the syslogger process attempted to allocate memory, failed because the system was low, and explicitly exited. That's consistent with an exited process, no core file generated, and the "FATAL" "out of memory" entries in the log I posted. Of course, that doesn't mean the syslogger was using much memory. It means only that it attempted to allocate any memory in a context where it could not handle failing. Andres Freund wrote: > I'm somewhat inclined to say that one has to live with this if the > system is so resource constrainted that processes barely using memory > get killed. Again, there's no OOM killer here, so the process has the opportunity to respond differently. I can understand the database not functioning well when the system is out of memory, but this half-functioning, half-deadlocked state is quite disruptive. By comparison, while not ideal, if the syslogger had raised SIGABRT instead, that would have been significantly better for my use-case. Our process manager would have terminated all of the processes associated with the database and restarted it. Tom Lane wrote: > Hm, so that's another angle David didn't report on: is it possible that > his workload could have resulted in a very large volume of incomplete > in-progress log messages? Yes. I mentioned in my Nov 6 mail that large log messages over a short period appear to have been a major contributing factor: > Here are all of the entries in the PostgreSQL log from 23:19:12 until the top of > the next hour: > https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt ... > There are also many log entries for some very long SQL queries. I'm sure that > contributed to this problem by filling up the pipe. I was able to extract the > contents of the pipe while the system was hung, and it was more of these giant > query strings. Andres Freund wrote: > Note that there's plenty of cases where you could run into this even > without being unable to fork new processes. You'd e.g. could also run > into this while logging the exit of some other subprocess or such, > there's enough ereports in postmaster. This is a critical point. As far as I can tell, all that's necessary for this deadlock to occur is: - the syslogger is unable to make forward progress (e.g., because it segfaulted) - enough other processes write to the pipe that it fills up before the postmaster can restart the syslogger - the postmaster attempts to write to the log as part of any of its signal handling operations It seems to me that the same thing can happen if there were a bug in the syslogger that resulted in a segfault around the same time that the postmaster attempted to start an autovacuum worker, for example (if there was also reasonably high log traffic). Thanks, Dave
Re: [GENERAL] postmaster deadlock while logging after syslogger exited
Thanks again for looking at this issue. On Mon, Nov 20, 2017 at 10:34 AM, Andres Freund wrote: > Hi, > > On 2017-11-20 10:13:57 -0800, David Pacheco wrote: > > I expect what happened is that the syslogger process attempted to > allocate > > memory, failed because the system was low, and explicitly exited. That's > > consistent with an exited process, no core file generated, and the > "FATAL" > > "out > > of memory" entries in the log I posted. > > Well unfortunately we're presumably not going to see output from > syslogger's death itself, chicken and egg etc. > If that's true, I think that's a separate bug. If a process receives ENOMEM from an attempt to allocate memory, it can still write out a log message to the already-open file descriptor indicating that it's exiting because of that failure. For all I know, that's what happened here, as there are FATAL "out of memory" entries in the log. (I'd be able to tell for sure if we had configured logging to include pids in the log messages.) > Tom Lane wrote: > > > Hm, so that's another angle David didn't report on: is it possible that > > > his workload could have resulted in a very large volume of incomplete > > > in-progress log messages? > > > > Yes. I mentioned in my Nov 6 mail that large log messages over a > > short period appear to have been a major contributing factor: > > So it's actually quite possible that syslogger did use a fair amount of > memory. Could you show its memory usage, just via ps? > I don't have information from the problematic case because the process had already exited. Right now, for the same database (actually a different instance, because this incident triggered a failover): $ ps -opid,rss,vsz,args -p 37627 PID RSS VSZ COMMAND 37627 2980 14968 /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data I'm not sure what we can infer from that, as this is a different system, and the workload that generates the very large query strings only runs occasionally. Those strings are also not logged unless something's gone wrong. > This is a critical point. As far as I can tell, all that's necessary > > for this deadlock to occur is: > > > > - the syslogger is unable to make forward progress (e.g., because it > > segfaulted) > > This specific case I don't care that much about, but ... I understand if the community isn't interested in fixing this case if other users aren't seeing it much, but surely it's still a bug that this unusual case can result in a deadlock? -- Dave
Re: [GENERAL] postmaster deadlock while logging after syslogger exited
On Mon, Nov 20, 2017 at 11:12 AM, David Pacheco wrote: > I understand if the community isn't interested in fixing this case if > other users aren't seeing it much, but surely it's still a bug that this > unusual case can result in a deadlock? > I've filed bug 14945 to cover this issue: https://www.postgresql.org/message-id/20171204201055.27108.18283%40wrigleys.postgresql.org While the underlying cause in this case was likely operator error (i.e., too little memory provisioned), the deadlock seems clearly like an issue that could be hit by others and requires operator intervention to restore service. I understand it might still be sufficiently rare, and the fix sufficiently complex, that the community doesn't end up prioritizing it. Thanks, Dave
Re: [GENERAL] postmaster deadlock while logging after syslogger exited
On Mon, Dec 4, 2017 at 12:23 PM, Andres Freund wrote: > Hi, > > On 2017-11-20 11:12:08 -0800, David Pacheco wrote: > > $ ps -opid,rss,vsz,args -p 37627 > > PID RSS VSZ COMMAND > > 37627 2980 14968 /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data > > > > I'm not sure what we can infer from that, as this is a different system, > > and the workload that generates the very large query strings only runs > > occasionally. Those strings are also not logged unless something's gone > > wrong. > > FWIW, I'd like to see a report of this around the time the issue > occurred before doing anything further here. > This failure begins when this process exits, so the best you could get is memory in use immediately before it exited. I obviously can't get that now for the one instance I saw weeks ago, but maybe PostgreSQL could log information about current memory usage when it's about to exit because of ENOMEM? That way if anybody hits a similar condition in the future, the data will be available to answer your question. That said, I think the deadlock itself is pretty well explained by the data we have already. -- Dave
Re: [GENERAL] postmaster deadlock while logging after syslogger exited
Thanks again for helping out. On Mon, Dec 4, 2017 at 2:12 PM, Andres Freund wrote: > On 2017-12-04 13:57:52 -0800, David Pacheco wrote: > > On Mon, Dec 4, 2017 at 12:23 PM, Andres Freund > wrote: > > > FWIW, I'd like to see a report of this around the time the issue > > > occurred before doing anything further here. > > > > > > > > > This failure begins when this process exits, so the best you could get is > > memory in use immediately before it exited. I obviously can't get that > now > > for the one instance I saw weeks ago, but maybe PostgreSQL could log > > information about current memory usage when it's about to exit because of > > ENOMEM? > > It already does so. > In that case, do you have the information you need in the log that I posted earlier in the thread? ( https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt ) What I was wondering about was the memory usage some time before it > dies. In particular while the workload with the long query strings is > running. ps output would be good, gdb'ing into the process and issuing > MemoryContextStats(TopMemoryContext) would be better. > Would it make sense for PostgreSQL to periodically sample the memory used by the current process, keep a small ringbuffer of recent samples, and then log all of that when it exits due to ENOMEM? One does not know that one is going to run into this problem before it happens, and it may not happen very often. (I've only seen it once.) The more PostgreSQL can keep the information needed to understand something like this after the fact, the better -- particularly since the overhead required to maintain this information should not be that substantial. > That way if anybody hits a similar condition in the future, the > > data will be available to answer your question. > > > > That said, I think the deadlock itself is pretty well explained by the > data > > we have already. > > Well, it doesn't really explain the root cause, and thus the extent of > the fixes required. If the root cause is the amount of memory used by > syslogger, we can remove the deadlock, but the experience is still going > to be bad. Obviously better, but still bad. > Fair enough. But we only know about one problem for sure, which is the deadlock. There may be a second problem of the syslogger using too much memory, but I don't think there's any evidence to point in that direction. Once the whole system is out of memory (and it clearly was, based on the log entries), anything that tried to allocate would fail, and the log reflects that a number of different processes did fail to allocate memory. I'd help investigate this question, but I have no more data about it, and I'm not sure when I will run into this again. Thanks, Dave