Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection

Murray S. Kucherawy Mon, 24 Jan 2011 01:16:07 -0800

> -----Original Message-----
> From: Robert Haas [mailto:[email protected]]
> Sent: Sunday, January 23, 2011 8:59 PM
> To: Murray S. Kucherawy
> Cc: Tom Lane; [email protected]
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
> 
> > As for the reply above, I disagree.  PQstatus(), as documented,
> > doesn't say anything about certain conditions in which it won't report
> > that the connection is dead, when it actually is, once the connection
> > was already established and working.
> 
> I understand that gut feeling, but I think if you think about it a
> little, you may realize that at some level it's an unreasonable
> expectation.  For example, suppose you were to connect to the server
> (successfully), unplug the network cable, and then call PQstatus().
> There is literally no way for PQstatus() to know whether that
> connection is still there.  It's just returning some internal flag out
> of the object.


True, of course.  But that's not the case here; PQsendRequest() was used to try 
to contact the server to submit some SQL operation after the server had been 
restarted.  Enough of that operation failed so as to make PQresultStatus() 
return PQRES_FATAL_ERROR.  There was enough I/O for libpq to figure out that 
something has gone wrong, in fact enough to know that the cause was 
administrative termination of the server for some reason (since the error 
string returned is "FATAL: terminating connection due to administrator 
command").  Basically it appears, then, that enough communication (or attempts 
at it) took place to observe that the connection not usable anymore and mark it 
as such, and even why.

Again, I'm looking at this as a consumer of the API; I don't know, and to some 
degree shouldn't care, what's going on inside.  The results simply look 
inconsistent.

I don't expect libpq to do any kind of keep-alive work, but once something has 
reported PGRES_FATAL_ERROR, I would expect the result returned by PQstatus() to 
be accurate.  In this case, it isn't.  It gives the impression that an 
incomplete state change occurred or something like that.

> Now, in this case, things are a bit less clear, because it's not as if
> the remote side has given us no indication that the connection is
> about to get closed.  This doesn't strike me as awesome protocol
> design, but 14 years on we're probably stuck with it. I think if you
> want to have some special handling when an administrative shutdown
> happens on the server, you should use PQresultErrorField() to check
> PG_DIAG_SQLSTATE; or if you want FATAL and PANIC conditions to be
> handled specially you can check PG_DIAG_SEVERITY.

The model in OpenDBX appears to be to query and then iterate getting results 
until there are none; if there's ever an error, see if it's a kind of error 
that requires a reconnection attempt before continuing.  Without assuming 
PGRES_FATAL_ERROR always means a reconnect is required, it seems like 
PQstatus() is the right test for the latter of those two "ifs", and that's what 
they're doing.  It doesn't really matter if the disconnect was a result of 
administrative action or not; the caller just needs to know if it has to try to 
reconnect before continuing.

> The description of PGRES_FATAL_ERROR in the documentation does pretty
> much suck.  I believe you'll get that if the backend returns either an
> ERROR (in which case you need to retry the transaction) or a FATAL (in
> which case you need to reset the connection), but that's not at all
> clear either from the documentation or the naming of the constant
> (which, alas, is hard to change at this point for
> backward-compatibility reasons).

Well my other suggestion would be to assume PGRES_FATAL_ERROR always means the 
connection needs to be reset.  But this blows that idea away; this would cause 
a connection reset that wouldn't actually solve the problem when it's an ERROR 
as you described.

> The description of PQstatus() looks correct to me.  It says that "a
> communications failure might result in the status changing to
> CONNECTION_BAD prematurely".  In the scenario you describe, no
> communications failure has occurred.  The server sent back an error
> message, and closed the connection (though libpq hasn't noticed yet)
> but there's no communications error anywhere until the client tries to
> send a query over a connection that doesn't exist any more.  Maybe we
> could flesh that description out a bit to make it more clear that this
> is really only going to tell you if TCP/IP has explicitly blown up in
> your face, and not any other reason why things might not be working,
> although I think there are hints of that there already.

Given what you say here, this seems to be an exception for which there is 
currently no detection mechanism short of looking for that one specific error 
string indicating it was administrative action causing the error.  I think your 
suggested changes would certainly help, as well as some generalized way to know 
that the connection handle is now unusable, either because of administrative 
server action or because the connection became unreliable for some other 
reason.  I'd be fine with having to do PQstatus() || PQadminStatus() too, if 
such a thing were to appear.  Better documentation of PGRES_FATAL_ERROR would 
help as well.

-MSK


-- 
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection

Reply via email to