Can PostgreSQL create new WAL files instead of reusing old ones?

2018-04-17 Thread David Pacheco
tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
WAL
files appears to significantly degrade query latency on ZFS.  The reason is
complicated and I have details below.  Has it been considered to make this
behavior tunable, to cause PostgreSQL to always create new WAL files
instead of
re-using old ones?

Context: we're running a large fleet of PostgreSQL shards.  Each shard
consists
of a primary, a synchronous standby, and an asynchronous standby using
chained
replication.  For this problem, we can consider only the primary and
synchronous standby.

PostgreSQL: 9.6.3
OS: illumos (SmartOS, mixed versions, but all from 2017 or later)
FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match
PostgreSQL's record size.  We have an SSD log device for completing
synchronous
writes very quickly.
WAL files are 16MB each, and we're keeping a very large number of segments.
(There's likely a lot of improvement for WAL tuning here.)

Since we're using an 8K record size, when PostgreSQL writes small (or
non-aligned) records to the WAL files, ZFS has to read the old contents in
order to write the new 8K record (read-modify-write).  If that's in cache,
that's not a big deal.  But when PostgreSQL decides to reuse an old WAL file
whose contents have been evicted from the cache (because they haven't been
used
in hours), this turns what should be a workload bottlenecked by synchronous
write
performance (that's well-optimized with our SSD log device) into a random
read
workload (that's much more expensive for any system).

What's significantly worse is that we saw this on synchronous standbys.
When
that happened, the WAL receiver was blocked on a random read from disk, and
since it's single-threaded, all write queries on the primary stop until the
random read finishes.  This is particularly bad for us when the sync is
doing
other I/O (e.g., for an autovacuum or a database backup) that causes disk
reads
to take hundreds of milliseconds.

Reusing old WAL files seems like an optimization intended for filesystems
that
allocate disk blocks up front.  With copy-on-write, it doesn't seem to make
much sense.  If instead of using an old WAL file, PostgreSQL instead just
created a new one, there would be no random reads required to complete these
operations, and we believe we'd completely eliminate our latency outliers.

Thanks,
Dave


Re: Can PostgreSQL create new WAL files instead of reusing old ones?

2018-04-17 Thread David Pacheco
On Tue, Apr 17, 2018 at 11:57 AM, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > David Pacheco wrote:
> >> tl;dr: We've found that under many conditions, PostgreSQL's re-use of
> old
> >> WAL
> >> files appears to significantly degrade query latency on ZFS.  The
> reason is
> >> complicated and I have details below.  Has it been considered to make
> this
> >> behavior tunable, to cause PostgreSQL to always create new WAL files
> >> instead of re-using old ones?
>
> > I don't think this has ever been proposed, because there was no use case
> > for it.  Maybe you want to work on a patch for it?
>
> I think possibly the OP doesn't understand why it's designed that way.
> The point is not really to "recycle old WAL files", it's to avoid having
> disk space allocation occur during the critical section where we must
> PANIC on failure.  Now, of course, that doesn't really work if the
> filesystem is COW underneath, because it's allocating fresh disk space
> anyway even though semantically we're overwriting existing data.
> But what I'd like to see is a fix that deals with that somehow, rather
> than continue to accept the possibility of ENOSPC occurring inside WAL
> writes on these file systems.  I have no idea what such a fix would
> look like :-(



I think I do understand, but as you've observed, recycling WAL files to
avoid allocation relies on the implementation details of the filesystem --
details that I'd expect not to be true of any copy-on-write filesystem.  On
such systems, there may not be a way to avoid ENOSPC in special critical
sections.  (And that's not necessarily such a big deal -- to paraphrase a
colleague, ensuring that the system doesn't run out of space does not seem
like a particularly surprising or heavy burden for the operator.  It's
great that PostgreSQL can survive this event better on some systems, but
the associated tradeoffs may not be worthwhile for everybody.)  And given
that, it seems worthwhile to provide the operator an option where they take
on the risk that the database might crash if it runs out of space (assuming
the result isn't data corruption) in exchange for a potentially tremendous
improvement in tail latency and overall throughput.

To quantify this: in a recent incident, transaction latency on the primary
was degraded about 2-3x (from a p90 of about 45ms to upwards of 120ms, with
outliers exceeding 1s).  Over 95% of the outliers above 1s spent over 90%
of their time blocked on synchronous replication (based on tracing with
DTrace).  On the synchronous standby, almost 10% of the WAL receiver's wall
clock time was spent blocked on disk reads in this read-modify-write path.
The rest of the time was essentially idle -- there was plenty of headroom
in other dimensions (CPU, synchronous write performance).

Thanks,
Dave


Re: Can PostgreSQL create new WAL files instead of reusing old ones?

2018-06-21 Thread David Pacheco
On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek 
wrote:

> As Dave described in his original email on this topic, we'd like to avoid
> recycling WAL files since that can cause performance issues when we have a
> read-modify-write on a file that has dropped out of the cache.
>
> I have implemented a small change to allow WAL recycling to be disabled.
> It is visible at:
> https://cr.joyent.us/#/c/4263/
>
> I'd appreciate getting any feedback on this.
>
> Thanks,
> Jerry
>
>

For reference, there's more context in this thread from several months ago:
https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#cacukrjo7djvub8e2aijoayj8bfkk3xxbtwu3kkaritr67m3...@mail.gmail.com

I'll repeat the relevant summary here:

tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
> WAL files appears to significantly degrade query latency on ZFS.  The
> reason is
> complicated and I have details below.  Has it been considered to make this
> behavior tunable, to cause PostgreSQL to always create new WAL files
> instead of re-using old ones?


Thanks,
Dave


Autovacuum degrades all other operations by keeping all buffers dirty?

2018-08-31 Thread David Pacheco
Hello,

We've been struggling with some major performance issues related to
autovacuum.
(I know this is a common problem.)  For a while, we believed this was
primarily
related to I/O contention, but recent observations from our monitoring make
me
wonder if there's a deeper issue here, so I'm looking for some better
understanding.

>From reading the 9.6.3 source, it looks like the autovacuum process itself
is
single-threaded, and it reads pages essentially linearly from the relation
(possibly skipping some).  When the autovacuum process needs to modify a
page,
it doesn't write it directly, but rather marks the buffer dirty.  The page
will
be written later, either by the checkpointer (according to its
configuration, in
terms of time and WAL), the bgwriter (according to its configuration, in
terms
of write count and sleep time), or else some other backend process that
requires
a free buffer (if the count of non-dirty buffers reaches zero).  Is this
accurate?

In our case, we found that when autovacuum runs, the number of dirty buffers
written by regular backends shoots from a handful to as much as 300 buffers
per
second.  (We have 200 backends on most databases.)  More specifically:
prior to
recent autovacuums starting, databases did under 1000 buffer allocations per
second.  Most of those were done by the checkpointer.  Individual backends
did
just a handful.  After autovacuum started, buffer allocations rose to
between
2,000 and 8,000 per second.  The checkpointer handled many of these, but so
did
individual backends (see above).  The bgwriter peaked around 25 buffer
writes
per second.  So it seems like the spike in buffers written by normal
backends
could explain the significant degradation that we see in average latency and
overall throughput (which can be as much as 50%).

It looks to me like the autovacuum process is effectively generating work
(in
the form of async writes) that's being distributed implicitly to the various
backend processes, creating latency for any other query that happens to
require
a buffer (including read-only queries).  Maybe we can improve this by
tuning the
bgwriter.  But if that's single-threaded, presumably there's an upper bound
to
how many buffer writes it can handle?  Is there anything to guarantee that
the
bgwriter will keep up with the work that's being generated?  That is, for
some
workloads, wouldn't it be possible that autovacuum could always generate
work
faster than the bgwriter can do it, and you'd always have some pretty major
degradation to all other queries?

I've drawn a lot of inferences here, and I'm sure there are some mistakes.
I'd
appreciate any clarification, correction, or confirmation!

Thanks in advance,
Dave


Re: Autovacuum degrades all other operations by keeping all buffers dirty?

2018-09-05 Thread David Pacheco
On Fri, Aug 31, 2018 at 3:50 PM, Andres Freund  wrote:

> On 2018-08-31 19:31:47 -0300, Alvaro Herrera wrote:
> > On 2018-Aug-31, David Pacheco wrote:
> >
> > > From reading the 9.6.3 source, it looks like the autovacuum process
> > > itself is single-threaded, and it reads pages essentially linearly
> > > from the relation (possibly skipping some).  When the autovacuum
> > > process needs to modify a page, it doesn't write it directly, but
> > > rather marks the buffer dirty.  The page will be written later,
> >
> > Unless there's some bug, there is a BufferAccessStrategy that only lets
> > a few dozen buffers go unwritten before the autovac worker process
> > itself is forced to write some.
>
> I've not re-checked, but I'm not sure that's true if the buffer is
> already in s_b, which it'll be for many workloads.
>


Does that mean this analysis from above is accurate?

It looks to me like the autovacuum process is effectively generating work
> (in
> the form of async writes) that's being distributed implicitly to the
> various
> backend processes, creating latency for any other query that happens to
> require
> a buffer (including read-only queries).


Thanks,
Dave


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-20 Thread David Pacheco
Responding to several points below.

Tom Lane wrote:
> David's report isn't too clear: did the syslogger process actually run
> out of memory and exit of its own volition after an ENOMEM, or did it get
> killed by the dreaded OOM killer?  In either case, it's unclear whether
> it was really using an excessive amount of memory.  We have not heard
> reports suggesting a memory leak in the syslogger, but maybe there is
> one under unusual circumstances?

I'm running on illumos, where there is no OOM killer.

I expect what happened is that the syslogger process attempted to allocate
memory, failed because the system was low, and explicitly exited.  That's
consistent with an exited process, no core file generated, and the "FATAL"
"out
of memory" entries in the log I posted.

Of course, that doesn't mean the syslogger was using much memory.  It means
only
that it attempted to allocate any memory in a context where it could not
handle
failing.


Andres Freund wrote:
> I'm somewhat inclined to say that one has to live with this if the
> system is so resource constrainted that processes barely using memory
> get killed.

Again, there's no OOM killer here, so the process has the opportunity to
respond
differently.  I can understand the database not functioning well when the
system
is out of memory, but this half-functioning, half-deadlocked state is quite
disruptive.  By comparison, while not ideal, if the syslogger had raised
SIGABRT
instead, that would have been significantly better for my use-case.  Our
process
manager would have terminated all of the processes associated with the
database
and restarted it.


Tom Lane wrote:
> Hm, so that's another angle David didn't report on: is it possible that
> his workload could have resulted in a very large volume of incomplete
> in-progress log messages?

Yes.  I mentioned in my Nov 6 mail that large log messages over a short
period
appear to have been a major contributing factor:

> Here are all of the entries in the PostgreSQL log from 23:19:12 until the
top of
> the next hour:
>
https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt
...
> There are also many log entries for some very long SQL queries.  I'm sure
that
> contributed to this problem by filling up the pipe.  I was able to
extract the
> contents of the pipe while the system was hung, and it was more of these
giant
> query strings.



Andres Freund wrote:
> Note that there's plenty of cases where you could run into this even
> without being unable to fork new processes. You'd e.g. could also run
> into this while logging the exit of some other subprocess or such,
> there's enough ereports in postmaster.

This is a critical point.  As far as I can tell, all that's necessary for
this
deadlock to occur is:

- the syslogger is unable to make forward progress (e.g., because it
segfaulted)
- enough other processes write to the pipe that it fills up before the
  postmaster can restart the syslogger
- the postmaster attempts to write to the log as part of any of its signal
  handling operations

It seems to me that the same thing can happen if there were a bug in the
syslogger that resulted in a segfault around the same time that the
postmaster
attempted to start an autovacuum worker, for example (if there was also
reasonably high log traffic).


Thanks,
Dave


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-20 Thread David Pacheco
Thanks again for looking at this issue.

On Mon, Nov 20, 2017 at 10:34 AM, Andres Freund  wrote:

> Hi,
>
> On 2017-11-20 10:13:57 -0800, David Pacheco wrote:
> > I expect what happened is that the syslogger process attempted to
> allocate
> > memory, failed because the system was low, and explicitly exited.  That's
> > consistent with an exited process, no core file generated, and the
> "FATAL"
> > "out
> > of memory" entries in the log I posted.
>
> Well unfortunately we're presumably not going to see output from
> syslogger's death itself, chicken and egg etc.
>


If that's true, I think that's a separate bug.  If a process receives
ENOMEM from an attempt to allocate memory, it can still write out a log
message to the already-open file descriptor indicating that it's exiting
because of that failure.  For all I know, that's what happened here, as
there are FATAL "out of memory" entries in the log.  (I'd be able to tell
for sure if we had configured logging to include pids in the log messages.)


> Tom Lane wrote:
> > > Hm, so that's another angle David didn't report on: is it possible that
> > > his workload could have resulted in a very large volume of incomplete
> > > in-progress log messages?
> >
> > Yes.  I mentioned in my Nov 6 mail that large log messages over a
> > short period appear to have been a major contributing factor:
>
> So it's actually quite possible that syslogger did use a fair amount of
> memory. Could you show its memory usage, just via ps?
>


I don't have information from the problematic case because the process had
already exited.  Right now, for the same database (actually a different
instance, because this incident triggered a failover):

$ ps -opid,rss,vsz,args -p 37627
  PID  RSS  VSZ COMMAND
37627 2980 14968 /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data

I'm not sure what we can infer from that, as this is a different system,
and the workload that generates the very large query strings only runs
occasionally.  Those strings are also not logged unless something's gone
wrong.


> This is a critical point.  As far as I can tell, all that's necessary
> > for this deadlock to occur is:
> >
> > - the syslogger is unable to make forward progress (e.g., because it
> > segfaulted)
>
> This specific case I don't care that much about, but ...



I understand if the community isn't interested in fixing this case if other
users aren't seeing it much, but surely it's still a bug that this unusual
case can result in a deadlock?

-- Dave


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-12-04 Thread David Pacheco
On Mon, Nov 20, 2017 at 11:12 AM, David Pacheco  wrote:

> I understand if the community isn't interested in fixing this case if
> other users aren't seeing it much, but surely it's still a bug that this
> unusual case can result in a deadlock?
>


I've filed bug 14945 to cover this issue:
https://www.postgresql.org/message-id/20171204201055.27108.18283%40wrigleys.postgresql.org

While the underlying cause in this case was likely operator error (i.e.,
too little memory provisioned), the deadlock seems clearly like an issue
that could be hit by others and requires operator intervention to restore
service.  I understand it might still be sufficiently rare, and the fix
sufficiently complex, that the community doesn't end up prioritizing it.

Thanks,
Dave


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-12-04 Thread David Pacheco
On Mon, Dec 4, 2017 at 12:23 PM, Andres Freund  wrote:

> Hi,
>
> On 2017-11-20 11:12:08 -0800, David Pacheco wrote:
> > $ ps -opid,rss,vsz,args -p 37627
> >   PID  RSS  VSZ COMMAND
> > 37627 2980 14968 /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data
> >
> > I'm not sure what we can infer from that, as this is a different system,
> > and the workload that generates the very large query strings only runs
> > occasionally.  Those strings are also not logged unless something's gone
> > wrong.
>
> FWIW, I'd like to see a report of this around the time the issue
> occurred before doing anything further here.
>


This failure begins when this process exits, so the best you could get is
memory in use immediately before it exited.  I obviously can't get that now
for the one instance I saw weeks ago, but maybe PostgreSQL could log
information about current memory usage when it's about to exit because of
ENOMEM?  That way if anybody hits a similar condition in the future, the
data will be available to answer your question.

That said, I think the deadlock itself is pretty well explained by the data
we have already.

-- Dave


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-12-04 Thread David Pacheco
Thanks again for helping out.


On Mon, Dec 4, 2017 at 2:12 PM, Andres Freund  wrote:

> On 2017-12-04 13:57:52 -0800, David Pacheco wrote:
> > On Mon, Dec 4, 2017 at 12:23 PM, Andres Freund 
> wrote:
> > > FWIW, I'd like to see a report of this around the time the issue
> > > occurred before doing anything further here.
> > >
> >
> >
> > This failure begins when this process exits, so the best you could get is
> > memory in use immediately before it exited.  I obviously can't get that
> now
> > for the one instance I saw weeks ago, but maybe PostgreSQL could log
> > information about current memory usage when it's about to exit because of
> > ENOMEM?
>
> It already does so.
>


In that case, do you have the information you need in the log that I posted
earlier in the thread?
(
https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt
)


What I was wondering about was the memory usage some time before it
> dies. In particular while the workload with the long query strings is
> running. ps output would be good, gdb'ing into the process and issuing
> MemoryContextStats(TopMemoryContext) would be better.
>


Would it make sense for PostgreSQL to periodically sample the memory used
by the current process, keep a small ringbuffer of recent samples, and then
log all of that when it exits due to ENOMEM?

One does not know that one is going to run into this problem before it
happens, and it may not happen very often.  (I've only seen it once.)  The
more PostgreSQL can keep the information needed to understand something
like this after the fact, the better -- particularly since the overhead
required to maintain this information should not be that substantial.


> That way if anybody hits a similar condition in the future, the
> > data will be available to answer your question.
> >
> > That said, I think the deadlock itself is pretty well explained by the
> data
> > we have already.
>
> Well, it doesn't really explain the root cause, and thus the extent of
> the fixes required. If the root cause is the amount of memory used by
> syslogger, we can remove the deadlock, but the experience is still going
> to be bad. Obviously better, but still bad.
>


Fair enough.  But we only know about one problem for sure, which is the
deadlock.  There may be a second problem of the syslogger using too much
memory, but I don't think there's any evidence to point in that direction.
Once the whole system is out of memory (and it clearly was, based on the
log entries), anything that tried to allocate would fail, and the log
reflects that a number of different processes did fail to allocate memory.
I'd help investigate this question, but I have no more data about it, and
I'm not sure when I will run into this again.

Thanks,
Dave