Hi, On 2015-02-05 23:03:02 +0200, Heikki Linnakangas wrote: > On 01/26/2015 12:14 PM, Andres Freund wrote: > >Hi, > > > >When working on getting rid of ImmediateInterruptOK I wanted to verify > >that ssl still works correctly. Turned out it didn't. But neither did it > >in master. > > > >Turns out there's two major things we do wrong: > > > >1) We ignore the rule that once called and returning > > SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called > > again with the same parameters. Unfortunately that rule doesn't mean > > just that the same parameters have to be passed in, but also that we > > can't just constantly switch between _read()/write(). Especially > > nonblocking backend code (i.e. walsender) and the whole frontend code > > violate this rule. > > I don't buy that. Googling around, and looking at the man pages, I just > don't see anything to support that claim. I believe it is supported to > switch between reads and writes.
There's at least two implementations that seem to have workaround to achieve something like that. Apache's mod_ssl and the ssl layer for libevent. > >2) We start renegotiations in be_tls_write() while in nonblocking mode, > > but don't properly retry to handle socket readyness. There's a loop > > that retries handshakes twenty times (???), but what actually is > > needed is to call SSL_get_error() and ensure that there's actually > > data available. > > Yeah, that's just crazy. > > Actually, I don't think we need to call SSL_do_handshake() at all. At least > on modern OpenSSL versions. OpenSSL internally uses a state machine, and > SSL_read() and SSL_write() calls will complete the handshake just as well. Some blaming seems to show that that's been the case for a long time. > >2) is easy enough to fix, but 1) is pretty hard. Before anybody says > >that 1) isn't an important rule: It reliably causes connection aborts > >within a couple renegotiations. The higher the latency the higher the > >likelihood of aborts. Even locally it doesn't take very long to > >abort. Errors usually are something like "SSL connection has been closed > >unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host > >of other similar messages. There's a couple reports of those in the > >archives and I've seen many more in client logs. > > Yeah, I can reproduce that. There's clearly something wrong. > I believe this is an OpenSSL bug. I traced the "unexpected message" error to > this code in OpenSSL's s3_read_bytes() function: Yep, I got to that as well. > > case SSL3_RT_APPLICATION_DATA: > > /* > > * At this point, we were expecting handshake data, but have > > * application data. If the library was running inside ssl3_read() > > * (i.e. in_read_app_data is set) and it makes sense to read > > * application data at this point (session renegotiation not yet > > * started), we will indulge it. > > */ > > if (s->s3->in_read_app_data && > > (s->s3->total_renegotiations != 0) && > > (((s->state & SSL_ST_CONNECT) && > > (s->state >= SSL3_ST_CW_CLNT_HELLO_A) && > > (s->state <= SSL3_ST_CR_SRVR_HELLO_A) > > ) || ((s->state & SSL_ST_ACCEPT) && > > (s->state <= SSL3_ST_SW_HELLO_REQ_A) && > > (s->state >= SSL3_ST_SR_CLNT_HELLO_A) > > ) > > )) { > > s->s3->in_read_app_data = 2; > > return (-1); > > } else { > > al = SSL_AD_UNEXPECTED_MESSAGE; > > SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD); > > goto f_err; > > } > > So that quite clearly throws an error, *unless* the original application > call was SSL_read(). As you also concluded, OpenSSL doesn't like it when > SSL_write() is called when it's in renegotiation. I think that's OpenSSL's > fault and it should "indulge" the application data message even in > SSL_write(). That'd be nice, but I think there were multiple reports to them about this and there wasn't much of a response. Maybe it's time to try again. > In theory, I guess we should do similar hacks in the server, and always call > SSL_read() before SSL_write(), but it seems to work without it. Or maybe > not; OpenSSL server and client code is not symmetric, so perhaps it works in > server mode without that. I think we should do it on the server side - got some server side errors when I cranked up the pg_receivexlog's status interval and set the wal sender timeout very low. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers