[HACKERS] Hot Standby query cancellation and Streaming Replication integration

Greg Smith Fri, 26 Feb 2010 00:34:08 -0800

I'm happy to see we've crossed the point where the worst of the Hot Standby and Streaming Replication issues are sorted out. A look at the to-do lists: http://wiki.postgresql.org/wiki/Hot_Standby_TODO http://wiki.postgresql.org/wiki/Streaming_Replication show no Must-fix items and 5 Serious Issues for Hot Standby left; there are 9 Streaming Replication items there, which aren't as clearly prioritized yet. Correct me if I'm wrong here, but those read to me like tweaks and polishing rather than major architecture issues at this point, so I believe that code is the same position as HS: some serious issues, but no really terrible parts.

The job Simon asked me to take a look at starting last week is which of the listed HS "Serious Issues" might be promoted into must-fix items after seeing how easy they were to encounter. There are a number of HS tunables that interact with one another, and depending your priorities a few ways you can try to optimize the configuration for what I expect to be common use cases for this feature. I've written a blog entry at http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html that tries to explain all that background clearly, and relate the implementation details to how I expect DBAs will perceive them. That was a bit much to also cover here, and had a broader audience that might appreciate it than just this list.

Attached is a tar file with some test case demo scripts that demonstrate the worst of the problems here IMHO. A README in there outlines how to set the problem demo up (presuming you've already gotten a HS pair going). What this does is execute the following sequence continuously on the master:

UPDATE pgbench_tellers SET tbalance = tbalance + <delta> WHERE tid = <tid>; (several times)

VACUUM pgbench_tellers;

Meanwhile, on the standby, the following long query runs on a few sessions at once, again looping constantly:


SELECT sum(abalance) FROM pgbench_accounts;

It took a bit of testing to get the database scale and iteration times here to easily encounter the issue here on my system, I hope this shows up easily enough for others with the values used. (I have a similar work in progress demo that tries to trigger the b-tree deletion problem too, will follow up once the storm of messages about this topic dies down, as I think this is a pre-requisite for it anyway)

I'm not sure what you might be expecting from the above combination, but what actually happens is that many of the SELECT statements on the table *that isn't even being updated* are canceled. You see this in the logs:


LOG:  restored log file "0000000100000000000000A5" from archive
ERROR:  canceling statement due to conflict with recovery

DETAIL: User query might have needed to see row versions that must be removed.

STATEMENT:  SELECT sum(abalance) FROM pgbench_accounts;

Basically, every time a WAL segment appears that wipes out a tuple that SELECT expects should still be visible, because the dead row left behind by the update has been vacuumed away, the query is canceled. This happens all the time the way I've set this up, and I don't feel like this is a contrived demo. Having a long-running query on the standby while things get updated and then periodically autovacuumed on the primary is going to be extremely common in the sorts of production systems I expect want HS the most.

Now, as explained on the blog entry and in the documentation, there are all sorts of ways you can work around this issue by tweaking parameters or doing fun tricks with dblink. You can prioritize any two of keeping the standby current, letting long-running queries execute on the standby, and keeping xid advances on the master moving forward as fast as possible. But you can't get all three at once. The choices available are really about the best you can do given a system that's basically the old warm-standby approach, improved with adding just Hot Standby to the mix. Sure, you might make the conflict resolution a bit smarter or make the UI for setting the parameters more friendly, and there's already been plenty of argument and patching over all of that. I don't want to belittle that work because it's been important to make HS a useful standalone feature, but I feel like that's all triage rather than looking for the most robust fix possible.

If you're running a system that also is using Streaming Replication, there is a much better approach possible. This idea has been floating around for a while and I am not taking credit for inventing it (too busy tonight to dig into the archives to figure out exactly when this popped up initially and who deserves credit for it). I'm just pointing out that now is the time where it's actually possible to implement. The HS TODO already includes the following action item, to resolve a serious issue you can run into (that itself would be great to eliminate):


"Requires keep-alives with timestamps to be added to sync rep feature"

If those keep-alives flowed in both directions, and included both timestamps *and* xid visibility information, the master could easily be configured to hold open xid snapshots needed for long running queries on the standby when that was necessary. I might be missing an implementation detail here, but from a high level it seems like you could make the walreceiver on the master publish the information about where the standby has advanced to as a bit of ProcArray xmin data. Then the master could only advance past where the standby says it cannot need visibility behind anymore.

This is a much more elegant solution than any of the hacks available so far. It would turn Hot Standby + Streaming Replication into a system that stepped out of the way of the worst of the technical limitations of HS alone. The master would easily and automatically avoid advancing past where the queries running on the standby needed visibility back to, essentially the same way cleanup is blocked during a long-running query on the primary--except with the actual main query work offloaded to the standby, the idea all along.

I don't know how difficult the keepalive feature was expected to be, and there's certainly plenty of potential landmines in this whole xid export idea. How to handle situations where the standby goes away for a while, such as a network outage, so that it doesn't block the master from ever cleaning up dead tuples is a concern. I wouldn't expect that to be too serious of a blocker, given that if the standby isn't talking it probably isn't running queries you need to worry about canceling either. Not sure where else this can fall down, and unfortunately I don't know nearly enough about the SR code to help myself with implementing this feature. (I think Simon is in a similar position--it's just not what we've been staring at the last few months).

But I do know that the current Hot Standby implementation is going to be frustrating to configure correctly for people. If it's possible to make most of that go away just by doing some final integration between it and Streaming Replication that just wasn't practical to accomplish until now, I think it's worth considering how to make that happen before the final 9.0 release.

I really hope this discussion can say focused on if and how it's possible to improve this area, with the goal being to deliver a product everyone can be proud of with the full feature set that makes this next release a killer one. The features that have managed to all get into this release already are fantastic, everyone who contributed should be proud of that progress, and it's encouraging that the alpha4 date was nailed. It would be easy to descend into finger-pointing for why exactly this particular problem is only getting more visibility now, or into schedule-oriented commentary suggesting it must be ignored because it's too late to do anything about it. I hope everyone appreciates wandering that way will not help make PostgreSQL 9.0 a better release. This issue is so easy to encounter, and looks so bad when it happens, that I feel it could easily lead to an embarrassing situation for the community if something isn't done about it before release.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us

hs-demo.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hot Standby query cancellation and Streaming Replication integration

Reply via email to