I believe I have found a bug in libpq COPY handling with non-blocking SSL connections. The bug manifests itself by dropping the connection in PGputCopyData() with "server closed the connection unexpectedly" message. The connection drop only occurs with nonblocking connections - blocking connections work as expected.
I'll skip a lot of debugging details, but what I found out was that the connection drop was caused by a SSL "bad write retry" from SSL_write(). Which in turn appears to be caused by bad data passed to SSL_write() retry. So, what happened was that at some point SSL_write() returned SSL_ERROR_WANT_WRITE meaning that the write needs to be retried. Now, according to OpenSSL manual the retry SSL_write needs to be passed the same arguments as the previous call: http://www.openssl.org/docs/ssl/SSL_write.html#WARNING However the next SSL_write was passed another buffer with a different length and it failed with SSL_ERROR_SSL "bad write retry". Which in turn caused the pqSendSome to drop the connection and failed the PQputCopyData. Actually, now that I think of it, this is probably not only related to COPY handling ... The connection drop can be reproduced by introducing some "network problems" to the connection, so that the SSL_write needs to be retried. This can be most easily accomplished by passing a huge buffer to PQputCopyData. For instance on my laptop I can do about 3 PQputCopyData with a 100MB buffer before the connection is dropped. Others results may vary. Also, it doesn't seem to matter what the libpq version is used - I initially started on 8.3, but later confirmed that the same problem exists on 8.4, 9.0 and 9.1. Tested with OpenSSL 0.9.8g and 0.9.8o, no idea about other versions. The following is a log excerpt from my test program (attached) with some extra logging added to libpq: PQputCopyData #0 buf:100000000 total:0 pqFlush() pqFlush() = 0 pqPutMsgEnd() pqSendSome(99999744) SSL_write(0x7e59d60, 0x7ff5a7eac010, 99999744) = 99999744: SSL_error: 0, errno: 0 pqPutMsgEnd() = 0 PQputCopyData #1 buf:100000000 total:100000000 pqPutMsgEnd() pqSendSome(99999744) SSL_write(0x7e59d60, 0x7ff5a7eac010, 99999744) = -1: SSL_error: 3, errno: 11 pqPutMsgEnd() = 0 PQputCopyData #2 buf:100000000 total:200000000 pqFlush() pqSendSome(100000266) SSL_write(0x7e59d60, 0x7ff5a7eac010, 100000266) = -1: SSL_error: 3, errno: 11 pqFlush() = 1 pqPutMsgEnd() pqSendSome(199999488) SSL_write(0x7e59d60, 0x7ff597eab010, 199999488) = -1: SSL_error: 1, errno: 0 SSL error: bad write retry pqPutMsgEnd() = -1 PQputCopyData() = -1: server closed the connection unexpectedly In PQputCopyData #2 it is visible that the first SSL_write called from pqFlush failed with SSL_ERROR_WANT_WRITE. The next SSL_write should have been a retry with the same parameters, but instead was passed a buffer with a different address and length. Hence the "bad write retry". Some googling turned out similar issues for other projects using SSL with non-blocking sockets. The possible workarounds are to disable SSL or to disable non-blocking libpq connections. Both are not always possible - security reasons, 3rd party applications, drivers, etc. So I think this should be fixed in libpq. Not sure exactly how though. It would seem that for the PQputCopyData the best would be to return 0 to indicate that the operation should be retried. No idea for the other possible cases of SSL_write() retry though. What do you think? regards, Martin
/* Debugging libpq SSL connection crashes. */ #include <stdio.h> #include <stdlib.h> #include <stdarg.h> #include "libpq-fe.h" #define BUFSIZE 100000000 #define CONNINFO "host=localhost sslmode=require" static char buf[BUFSIZE]; void die(char *fmt, ...) { va_list ap; va_start(ap, fmt); vfprintf(stderr, fmt, ap); va_end(ap); exit(1); } void exec_sql(PGconn *conn, const char *sql, int expected_status) { PGresult *res; fprintf(stderr, "sql: %s\n", sql); res = PQexec(conn, sql); if (PQresultStatus(res) != expected_status) die("failed: %s\n", sql, PQerrorMessage(conn)); } int main(void) { PGconn *conn; PGresult *res; unsigned long bytes = 0; int count = 0; conn = PQconnectdb(CONNINFO); if (PQstatus(conn) != CONNECTION_OK) die("connection to database failed: %s", PQerrorMessage(conn)); fprintf(stderr, "Connected to: %s\n", CONNINFO); exec_sql(conn, "BEGIN", PGRES_COMMAND_OK); exec_sql(conn, "DROP TABLE IF exists test_ssl_copy", PGRES_COMMAND_OK); exec_sql(conn, "CREATE TABLE test_ssl_copy(t text)", PGRES_COMMAND_OK); exec_sql(conn, "COPY test_ssl_copy(t) FROM stdin", PGRES_COPY_IN); /* Set the connection to non-blocking to enable crashing */ if (PQsetnonblocking(conn, 1) != 0) die("PQsetnonblocking failed: %s", PQerrorMessage(conn)); while (1) { int rc; fprintf(stderr, "PQputCopyData #%d buf:%lu total:%lu\n", count, sizeof(buf), bytes); rc = PQputCopyData(conn, buf, sizeof(buf)); if (rc == 0) { fprintf(stderr, "PQputCopyData needs retry\n"); } if (rc < 0) { die("PQputCopyData() = %d: %s", rc, PQerrorMessage(conn)); } bytes += sizeof(buf); ++count; } if (PQputCopyEnd(conn, NULL) != 1) die("PQputCopyEnd failed: %s", PQerrorMessage(conn)); while ((res = PQgetResult(conn)) != NULL) if (PQresultStatus(res) != PGRES_COMMAND_OK) fprintf(stderr, "COPY completed with error, status=%d: %s", PQresultStatus(res), PQerrorMessage(conn)); exec_sql(conn, "ROLLBACK", PGRES_COMMAND_OK); PQfinish(conn); return 0; }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers