On Wed, 2021-06-16 at 15:31 +0200, Daniel Gustafsson wrote: > > On 16 Jun 2021, at 01:50, Jacob Champion <pchamp...@vmware.com> wrote: > > I've been tracking down reference leaks in the client. These open > > references prevent NSS from shutting down cleanly, which then makes it > > impossible to open a new context in the future. This probably affects > > other libpq clients more than it affects psql. > > Ah, nice catch, that's indeed a bug in the frontend implementation. The > problem is that the NSS trustdomain cache *must* be empty before shutting down > the context, else this very issue happens. Note this in be_tls_destroy(): > > /* > * It reads a bit odd to clear a session cache when we are destroying the > * context altogether, but if the session cache isn't cleared before > * shutting down the context it will fail with SEC_ERROR_BUSY. > */ > SSL_ClearSessionCache(); > > Calling SSL_ClearSessionCache() in pgtls_close() fixes the error.
That's unfortunate. The session cache is global, right? So I'm guessing we'll need to refcount and lock that call, to avoid cleaning up out from under a thread that's actively using the the cache? > There is another resource leak left (visible in one test after the above is > added), the SECMOD module needs to be unloaded in case it's been loaded. > Implementing that with SECMOD_UnloadUserModule trips a segfault in NSS which I > have yet to figure out (when acquiring a lock with NSSRWLock_LockRead). > > [...] > > With your patches I'm seeing a couple of these: > > SSL error: The one-time function was previously called and failed. Its > error code is no longer available Hmm. Adding SSL_ClearSessionCache() (without thread-safety at the moment) fixes all of the SSL tests for me, and I don't see either the SECMOD leak or the "one-time function" error that you've mentioned. What version of NSS are you running? I'm on 3.63. I've attached my current patchset (based on v37) for comparison. > > I am currently stuck on one last failing test. This leak seems to only > > show up when using TLSv1.2 or below. > > AFAICT the session cache is avoided for TLSv1.3 due to 1.3 not supporting > renegotiation. Nice, at least that mystery is solved. :D Thanks, --Jacob
From 6976d15a4be50276ea2f77fd5c37d238493f9447 Mon Sep 17 00:00:00 2001 From: Jacob Champion <pchamp...@vmware.com> Date: Tue, 15 Jun 2021 10:01:14 -0700 Subject: [PATCH 1/3] nss: don't ignore failures during context shutdown The biggest culprit of a shutdown failure so far seems to be object leaks. A failure here may prevent future contexts from being created (and they'll fail in confusing ways), so don't ignore it. There's not a great way to signal errors from this layer of the stack, but a notice will at least give us a chance of seeing the problem. --- src/interfaces/libpq/fe-secure-nss.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/interfaces/libpq/fe-secure-nss.c b/src/interfaces/libpq/fe-secure-nss.c index 1b9bd4f1a5..90f57e0d99 100644 --- a/src/interfaces/libpq/fe-secure-nss.c +++ b/src/interfaces/libpq/fe-secure-nss.c @@ -139,7 +139,16 @@ pgtls_close(PGconn *conn) { if (conn->nss_context) { - NSS_ShutdownContext(conn->nss_context); + SECStatus status; + status = NSS_ShutdownContext(conn->nss_context); + + if (status != SECSuccess) + { + pqInternalNotice(&conn->noticeHooks, + "unable to shut down NSS context: %s", + pg_SSLerrmessage(PR_GetError())); + } + conn->nss_context = NULL; } } -- 2.25.1
From acabc4d51d38124db748a6be9dca0ff83693e039 Mon Sep 17 00:00:00 2001 From: Jacob Champion <pchamp...@vmware.com> Date: Tue, 15 Jun 2021 14:37:28 -0700 Subject: [PATCH 2/3] test: check for empty stderr during connect_ok() ...in a similar manner to command_like(), to catch notices-as-errors coming from NSS. --- src/test/authentication/t/001_password.pl | 2 +- src/test/authentication/t/002_saslprep.pl | 2 +- src/test/kerberos/t/001_auth.pl | 2 +- src/test/ldap/t/001_auth.pl | 2 +- src/test/perl/PostgresNode.pm | 5 ++++- src/test/ssl/t/001_ssltests.pl | 4 ++-- src/test/ssl/t/002_scram.pl | 4 ++-- 7 files changed, 12 insertions(+), 9 deletions(-) diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl index 427a360198..327dfb4ed9 100644 --- a/src/test/authentication/t/001_password.pl +++ b/src/test/authentication/t/001_password.pl @@ -20,7 +20,7 @@ if (!$use_unix_sockets) } else { - plan tests => 23; + plan tests => 32; } diff --git a/src/test/authentication/t/002_saslprep.pl b/src/test/authentication/t/002_saslprep.pl index f080a0ccba..d6508df8bd 100644 --- a/src/test/authentication/t/002_saslprep.pl +++ b/src/test/authentication/t/002_saslprep.pl @@ -17,7 +17,7 @@ if (!$use_unix_sockets) } else { - plan tests => 12; + plan tests => 20; } # Delete pg_hba.conf from the given node, add a new entry to it diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl index b5594924ca..62015339a0 100644 --- a/src/test/kerberos/t/001_auth.pl +++ b/src/test/kerberos/t/001_auth.pl @@ -23,7 +23,7 @@ use Time::HiRes qw(usleep); if ($ENV{with_gssapi} eq 'yes') { - plan tests => 44; + plan tests => 54; } else { diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl index 0ae14e4c85..e857c26258 100644 --- a/src/test/ldap/t/001_auth.pl +++ b/src/test/ldap/t/001_auth.pl @@ -9,7 +9,7 @@ use Test::More; if ($ENV{with_ldap} eq 'yes') { - plan tests => 28; + plan tests => 40; } else { diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 2027cbf43d..4a1e151d7e 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -2018,8 +2018,11 @@ sub connect_ok if (defined($params{expected_stdout})) { - like($stdout, $params{expected_stdout}, "$test_name: matches"); + like($stdout, $params{expected_stdout}, "$test_name: stdout matches"); } + + is($stderr, "", "$test_name: no stderr"); + if (@log_like or @log_unlike) { my $log_contents = TestLib::slurp_file($self->logfile, $log_location); diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl index aec99e7bf6..c80abc9ec9 100644 --- a/src/test/ssl/t/001_ssltests.pl +++ b/src/test/ssl/t/001_ssltests.pl @@ -18,12 +18,12 @@ my $nss; if ($ENV{with_ssl} eq 'openssl') { $openssl = 1; - plan tests => 112; + plan tests => 144; } elsif ($ENV{with_ssl} eq 'nss') { $nss = 1; - plan tests => 112; + plan tests => 138; } else { diff --git a/src/test/ssl/t/002_scram.pl b/src/test/ssl/t/002_scram.pl index adfdc55e1a..7976b2e5b1 100644 --- a/src/test/ssl/t/002_scram.pl +++ b/src/test/ssl/t/002_scram.pl @@ -27,13 +27,13 @@ if ($ENV{with_ssl} eq 'openssl') # Determine whether build supports tls-server-end-point. $supports_tls_server_end_point = check_pg_config("#define HAVE_X509_GET_SIGNATURE_NID 1"); - plan tests => $supports_tls_server_end_point ? 11 : 12; + plan tests => 15; } elsif ($ENV{with_ssl} eq 'nss') { $nss = 1; $supports_tls_server_end_point = 1; - plan tests => 11; + plan tests => 15; } else { -- 2.25.1
From 16cad09e1c534391c6aad5c8f552da6a7e6949fd Mon Sep 17 00:00:00 2001 From: Jacob Champion <pchamp...@vmware.com> Date: Tue, 15 Jun 2021 15:59:39 -0700 Subject: [PATCH 3/3] nss: clean up leaked resources The NSPR file descriptor, a handful of certificate references, and session cache entries were preventing NSS from shutting down. --- src/interfaces/libpq/fe-secure-nss.c | 44 +++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/src/interfaces/libpq/fe-secure-nss.c b/src/interfaces/libpq/fe-secure-nss.c index 90f57e0d99..e9dacdc732 100644 --- a/src/interfaces/libpq/fe-secure-nss.c +++ b/src/interfaces/libpq/fe-secure-nss.c @@ -137,9 +137,32 @@ pgtls_init(PGconn *conn, bool do_ssl, bool do_crypto) void pgtls_close(PGconn *conn) { + /* All NSS references must be cleaned up before we close out the context. */ + if (conn->pr_fd) + { + PRStatus status; + + status = PR_Close(conn->pr_fd); + if (status != PR_SUCCESS) + { + pqInternalNotice(&conn->noticeHooks, + "unable to close NSPR fd: %s", + pg_SSLerrmessage(PR_GetError())); + } + + conn->pr_fd = NULL; + } + if (conn->nss_context) { SECStatus status; + + /* + * The session cache must be cleared, or we'll leak references. + * TODO: refcount maybe? This has a global effect. + */ + SSL_ClearSessionCache(); + status = NSS_ShutdownContext(conn->nss_context); if (status != SECSuccess) @@ -573,16 +596,16 @@ pgtls_get_peer_certificate_hash(PGconn *conn, size_t *len) SECOidTag signature_tag; SECOidTag digest_alg; int digest_len; - PLArenaPool *arena; + PLArenaPool *arena = NULL; SECItem digest; - char *ret; + char *ret = NULL; SECStatus status; *len = 0; server_cert = SSL_PeerCertificate(conn->pr_fd); if (!server_cert) - return NULL; + goto cleanup; signature_tag = SECOID_GetAlgorithmTag(&server_cert->signature); if (!pg_find_signature_algorithm(signature_tag, &digest_alg, &digest_len)) @@ -590,7 +613,7 @@ pgtls_get_peer_certificate_hash(PGconn *conn, size_t *len) printfPQExpBuffer(&conn->errorMessage, libpq_gettext("could not find digest for OID '%s'\n"), SECOID_FindOIDTagDescription(signature_tag)); - return NULL; + goto cleanup; } arena = PORT_NewArena(SEC_ASN1_DEFAULT_ARENA_SIZE); @@ -605,14 +628,18 @@ pgtls_get_peer_certificate_hash(PGconn *conn, size_t *len) printfPQExpBuffer(&conn->errorMessage, libpq_gettext("unable to generate peer certificate digest: %s"), pg_SSLerrmessage(PR_GetError())); - PORT_FreeArena(arena, PR_TRUE); - return NULL; + goto cleanup; } ret = pg_malloc(digest.len); memcpy(ret, digest.data, digest.len); *len = digest_len; - PORT_FreeArena(arena, PR_TRUE); + +cleanup: + if (arena) + PORT_FreeArena(arena, PR_TRUE); + if (server_cert) + CERT_DestroyCertificate(server_cert); return ret; } @@ -712,6 +739,8 @@ done: /* san_list will be freed by freeing the arena it was allocated in */ if (arena) PORT_FreeArena(arena, PR_TRUE); + if (server_cert) + CERT_DestroyCertificate(server_cert); PR_Free(server_hostname); if (status == SECSuccess) @@ -826,6 +855,7 @@ pg_cert_auth_handler(void *arg, PRFileDesc *fd, PRBool checksig, PRBool isServer pg_SSLerrmessage(PR_GetError())); } + CERT_DestroyCertificate(server_cert); return status; } -- 2.25.1