Josh Berkus wrote:
However, this leaves aside Greg's point about snapshot age and successive queries; does anyone dispute his analysis? Simon?
There's already a note on the Hot Standby TODO about unexpectly bad max_standby_delay behavior being possible on an idle system, with no suggested resolution for it besides better SR integration. The issue Greg Stark has noted is another variation on that theme. It's already on my list of theorized pathological but as yet undemonstrated concerns that Simon and I identified, the one I'm working through creating a test cases to prove/disprove. I'm past "it's possible..." talks at this point though as not to spook anyone unnecessarily, and am only raising things I can show concrete examples of in action. White box testing at some point does require pausing one's investigation of what's in the box and getting on with the actual testing instead.
The only real spot where my opinion diverges here that I have yet to find any situation where 'max_standby_delay=-1' makes any sense to me. When I try running my test cases with that setting, the whole system just reacts far too strangely. My first patch here is probably going to be adding more visibility into the situation when queries are blocking replication forever, because I think the times I find myself at "why is the system hung right now?" are when that happens and it's not obvious as an admin what's going on.
Also, the idea that a long running query on the standby could cause an unbounded delay in replication is so foreign to my sensibilities that I don't ever include it in the list of useful solutions to the problems I'm worried about. The option is there, not disputing that it makes sense for some people because there seems some demand for it, just can't see how it fits into any of the use-cases I'm concerned about.
I haven't said anything about query retry mainly because I can't imagine any way it's possible to build it in time for this release, so whether it's eventually feasible or not doesn't enter into what I'm worried about right now. In any case, I would prioritize that behind work on preventing the most common situations that cause cancellations in the first place, until those are handled so well that retry is the most effective improvement left to consider.
-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers