Hi, On 2018-12-16 22:33:00 +1100, Thomas Munro wrote: > On Fri, Dec 14, 2018 at 4:14 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Andres Freund <and...@anarazel.de> writes: > > > On December 13, 2018 6:01:04 PM PST, Tom Lane <t...@sss.pgh.pa.us> wrote: > > >> Has anyone tried to reproduce this on other platforms? > > > > > I recently also hit this locally, but since that's also Debian > > > unstable... Note that removing openssl "fixed" the issue for me. > > > > FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success. > > It's possible that there's some significant detail of your configuration > > that I didn't match, but on the whole "bug in Debian unstable" seems > > like the most probable theory right now. > > I was keen to try to bisect this, but I couldn't reproduce it on a > freshly upgraded Debian unstable VM, with --with-openssl, using "make > installcheck" under src/test/authentication. I even tried using the > gold linker as skink does. Maybe I'm using the wrong checker > options... Andres, can we see your exact valgrind invocation?
Ok, I think I've narrowed this down a bit further. But far from completely. I don't think you need particularly special options, but it's easy to miss the error, because it doesn't cause postmaster to exit with an error. It only happens when a bgworker is shutdown with SIGQUIT (be it directly, or via postmaster immediate shutdown): $ valgrind --quiet --error-exitcode=55 --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp --suppressions=/home/andres/tmp/valgrind-global.supp --trace-children=yes --track-origins=yes --read-var-info=no --num-callers=20 --leak-check=no --gen-suppressions=all /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -D /srv/dev/pgdev-dev 2018-12-16 12:53:26.274 PST [1187] LOG: listening on IPv4 address "127.0.0.1", port 5433 $ kill -QUIT 1187 ==1194== Invalid read of size 8 ==1194== at 0x4C3B5A5: check_free (dlerror.c:188) ==1194== by 0x4C3BAB1: free_key_mem (dlerror.c:221) ==1194== by 0x4C3BAB1: __dlerror_main_freeres (dlerror.c:239) ==1194== by 0x53D6F81: __libc_freeres (in /lib/x86_64-linux-gnu/libc-2.28.so) ==1194== by 0x482D19E: _vgnU_freeres (vg_preloaded.c:77) ==1194== by 0x567F54: bgworker_quickdie (bgworker.c:662) ==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==1194== by 0x5367B76: epoll_wait (epoll_wait.c:30) ==1194== by 0x5EE7CC: WaitEventSetWaitBlock (latch.c:1078) ==1194== by 0x5EE6A5: WaitEventSetWait (latch.c:1030) ==1194== by 0x5EDDBC: WaitLatchOrSocket (latch.c:407) ==1194== by 0x5EDC23: WaitLatch (latch.c:347) ==1194== by 0x5992D7: ApplyLauncherMain (launcher.c:1062) ==1194== by 0x568245: StartBackgroundWorker (bgworker.c:835) ==1194== by 0x57C295: do_start_bgworker (postmaster.c:5742) ==1194== by 0x57C631: maybe_start_bgworkers (postmaster.c:5955) ==1194== by 0x578C3C: reaper (postmaster.c:2940) ==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==1194== by 0x535F3B6: select (select.c:41) ==1194== by 0x576A9F: ServerLoop (postmaster.c:1677) ==1194== by 0x57642A: PostmasterMain (postmaster.c:1386) ==1194== Address 0x708d488 is 12 bytes after a block of size 12 alloc'd ==1194== at 0x483577F: malloc (vg_replace_malloc.c:299) ==1194== by 0x4AD8D38: CRYPTO_zalloc (mem.c:230) ==1194== by 0x4AD4F8D: ossl_init_get_thread_local (init.c:66) ==1194== by 0x4AD4F8D: ossl_init_get_thread_local (init.c:59) ==1194== by 0x4AD4F8D: ossl_init_thread_start (init.c:426) ==1194== by 0x4AFE5B9: RAND_DRBG_get0_public (drbg_lib.c:1118) ==1194== by 0x4AFE5EF: drbg_bytes (drbg_lib.c:963) ==1194== by 0x7F6DD9: pg_strong_random (pg_strong_random.c:135) ==1194== by 0x57B70F: RandomCancelKey (postmaster.c:5251) ==1194== by 0x57C367: assign_backendlist_entry (postmaster.c:5822) ==1194== by 0x57C0F2: do_start_bgworker (postmaster.c:5692) ==1194== by 0x57C631: maybe_start_bgworkers (postmaster.c:5955) ==1194== by 0x578C3C: reaper (postmaster.c:2940) ==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==1194== by 0x535F3B6: select (select.c:41) ==1194== by 0x576A9F: ServerLoop (postmaster.c:1677) ==1194== by 0x57642A: PostmasterMain (postmaster.c:1386) ==1194== by 0x4997E0: main (main.c:228) I now suspect this is a more longrunning issue than I thought. Not all my valgrind buildfarm branches have ssl enabled (due to an ssl issue a while back). And previously this wouldn't have been caught, because it doesn't cause postmaster to fail, it's just that Andrew added a script that checks logs for valgrind bleats. The interesting bit is that if I replace the _exit(2) in bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers), or manully add an OPENSSL_cleanup() before the _exit(2), valgrind doesn't find errors. The fact that one needs an immediate shutdown in a bgworker, with openssl enabled, explains why this is hard to hit... Greetings, Andres Freund