Hi, I tried to run postgres with ubsan to debug something. I ran into two main issues:
1) Despite Tom's recent efforts in [1], I see two ubsan failures in HEAD. These are easy enough to fix as per the attached, although the fix for the GetConfigOptionByNum() isn't great - we should probably pass a nulls array to GetConfigOptionByNum() as well, but that'd have a bunch of followup changes. So I'm inclined to go with the minimal for now. 2) When debugging issues I got very confused by the fact that *sometimes* UBSAN_OPTIONS takes effect and sometimes not. I was trying to have ubsan generate backtraces as well as a coredump with UBSAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2" After a lot of debugging I figured out that the options took effect in postmaster, but not in any children. Which in turn is because set_ps_display() breaks /proc/$pid/environ - it's empty in all postmaster children for me. The sanitizer library use /proc/$pid/environ (from [2] to [3]), because they don't want to use getenv() because it wants to work without libc (whether that's the right way, i have my doubts). When just using undefined and alignment sanitizers, the sanitizer library is only initialized when the first error occurs, by which time we've often already called set_ps_display(). Note that ps_status.c breaks /proc/$pid/environ even if the update_process_title GUC is set to false, because init_ps_display() ignores that. Yes, that confused me for a while last night. The reason that /proc/$pid/environ is borken is fairly obvious: We overwrite it in set_ps_display() in the PS_USE_CLOBBER_ARGV case. The kernel just looks at the range set up originally, which we'll overwrite with zeroes. We could try telling the kernel about the new location of environ using prctl() PR_SET_MM_ENV_START/END but the restrictions around that sound problematic. I've included a hacky workaround: Define __ubsan_default_options, a weak symbol libsanitizer uses to get defaults from the application, and return getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we don't end up relying on a not-yet-working getenv(). This also lead me to finally figure out why /proc/$pid/cmdline of postgres children includes a lot of NULL bytes. I'd noticed this because tools like pidstat -l started to print long long status strings at some point, before it got fixed on the pidstat side. The way we overwrite doesn't trigger this special case in the kernel: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L296 therefore the original size of the commandline arguments are printed, with a lot of boring zeroes. A test run of this in ci, with the guc.c intentionally reintroduced, shows the failure both via core dump https://api.cirrus-ci.com/v1/task/6543164315009024/logs/cores.log and in postmaster's log: https://api.cirrus-ci.com/v1/artifact/task/6543164315009024/log/src/test/regress/log/postmaster.log although that part is a bit annoying to read, because the error is interspersed with other log messages: guc.c:9801:15: runtime error: null pointer passed as argument 2, which is declared to never be null ==19253==Using libbacktrace symbolizer. 2022-03-23 17:20:35.412 UTC [19258][client backend] [pg_regress/alter_operator][14/429:0] ERROR: must be owner of operator === ... 2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT: ALTER STATISTICS alt_stat2 OWNER TO regress_alter_generic_user2; #0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801 #1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137 2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR: must be owner of statistics object alt_stat3 ... 2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT: ALTER STATISTICS alt_stat2 OWNER TO regress_alter_generic_user2; #0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801 #1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137 2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR: must be owner of statistics object alt_stat3 2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] STATEMENT: ALTER STATISTICS alt_stat3 RENAME TO alt_stat4; #2 0x562655c0ea86 in ExecMakeTableFunctionResult /tmp/cirrus-ci-build/src/backend/executor/execSRF.c:234 #3 0x562655c3f8be in FunctionNext /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:95 2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] ERROR: must be owner of statistics object alt_stat3 2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] STATEMENT: ALTER STATISTICS alt_stat3 OWNER TO regress_alter_generic_user2; 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] ERROR: must be member of role "regress_alter_generic_user3" 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] STATEMENT: ALTER STATISTICS alt_stat2 OWNER TO regress_alter_generic_user3; #4 0x562655c10175 in ExecScanFetch /tmp/cirrus-ci-build/src/backend/executor/execScan.c:133 #5 0x562655c10653 in ExecScan /tmp/cirrus-ci-build/src/backend/executor/execScan.c:199 #6 0x562655c3f643 in ExecFunctionScan /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:270 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] ERROR: must be owner of statistics object alt_stat3 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] STATEMENT: ALTER STATISTICS alt_stat3 SET SCHEMA alt_nsp2; 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] ERROR: statistics object "alt_stat2" already exists in schema "alt_nsp2" 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] STATEMENT: ALTER STATISTICS alt_stat2 SET SCHEMA alt_nsp2; #7 0x562655c09bc9 in ExecProcNodeFirst /tmp/cirrus-ci-build/src/backend/executor/execProcnode.c:463 #8 0x562655bf7580 in ExecProcNode ../../../src/include/executor/executor.h:259 #9 0x562655bf7580 in ExecutePlan /tmp/cirrus-ci-build/src/backend/executor/execMain.c:1633 #10 0x562655bf78b9 in standard_ExecutorRun /tmp/cirrus-ci-build/src/backend/executor/execMain.c:362 Greetings, Andres Freund [1] https://postgr.es/m/calnj-vt9r0dssaow9oxvjfxlenovs_68kj5x0p44atoyh+h...@mail.gmail.com [2] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/ubsan/ubsan_flags.cpp;h=9a66bd37518b3a0606049b761ffdd7ddf3c3c714;hb=refs/heads/master#l68 [2] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/sanitizer_common/sanitizer_linux.cpp;h=aa59d9718ca89cc554bdf677df3e64ddd233ca59;hb=refs/heads/master#l559
>From 66117f8c593c5041bde9f50f95bfd7a6a0664744 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Wed, 23 Mar 2022 10:01:00 -0700 Subject: [PATCH v3 1/5] configure: check for dlsym not just dlopen. When building with sanitizers the sanitizer library provides dlopen, but not dlsym(), making configure think that -ldl isn't needed. --- configure | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++ configure.ac | 3 +++ 2 files changed, 59 insertions(+) diff --git a/configure b/configure index f3cb5c2b511..e605f5a17f1 100755 --- a/configure +++ b/configure @@ -11912,6 +11912,62 @@ if test "$ac_res" != no; then : fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing dlsym" >&5 +$as_echo_n "checking for library containing dlsym... " >&6; } +if ${ac_cv_search_dlsym+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_func_search_save_LIBS=$LIBS +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char dlsym (); +int +main () +{ +return dlsym (); + ; + return 0; +} +_ACEOF +for ac_lib in '' dl; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_c_try_link "$LINENO"; then : + ac_cv_search_dlsym=$ac_res +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_dlsym+:} false; then : + break +fi +done +if ${ac_cv_search_dlsym+:} false; then : + +else + ac_cv_search_dlsym=no +fi +rm conftest.$ac_ext +LIBS=$ac_func_search_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_dlsym" >&5 +$as_echo "$ac_cv_search_dlsym" >&6; } +ac_res=$ac_cv_search_dlsym +if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + +fi + # Address Sanitizer provides dlopen but not dlsym { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing socket" >&5 $as_echo_n "checking for library containing socket... " >&6; } if ${ac_cv_search_socket+:} false; then : diff --git a/configure.ac b/configure.ac index 19d1a803673..b2440e9ae2d 100644 --- a/configure.ac +++ b/configure.ac @@ -1230,6 +1230,9 @@ AC_SUBST(PTHREAD_LIBS) AC_CHECK_LIB(m, main) AC_SEARCH_LIBS(setproctitle, util) AC_SEARCH_LIBS(dlopen, dl) +# Address sanitizer's helper library provides dlopen but not dlsym, thus when +# enabling asan the dlopen check doesn't notice that -ldl is actually required. +AC_SEARCH_LIBS(dlsym, dl) AC_SEARCH_LIBS(socket, [socket ws2_32]) AC_SEARCH_LIBS(shl_load, dld) AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt]) -- 2.35.1.354.g715d08a9e5
>From 97f22d7c7b126464fa2262f93519bcb3bc25c3dc Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Wed, 23 Mar 2022 09:57:19 -0700 Subject: [PATCH v3 2/5] Hack up compatibility between ubsan and ps_status.c. --- src/backend/main/main.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/src/backend/main/main.c b/src/backend/main/main.c index c43a527d3f9..7998fdd1f3f 100644 --- a/src/backend/main/main.c +++ b/src/backend/main/main.c @@ -48,6 +48,7 @@ const char *progname; +static bool reached_main = false; static void startup_hacks(const char *progname); @@ -64,6 +65,8 @@ main(int argc, char *argv[]) { bool do_check_root = true; + reached_main = true; + /* * If supported on the current platform, set up a handler to be called if * the backend/postmaster crashes with a fatal signal or exception. @@ -443,3 +446,18 @@ check_root(const char *progname) } #endif /* WIN32 */ } + +const char *__ubsan_default_options(void); +const char * +__ubsan_default_options(void) +{ + /* don't call libc before it's initialized */ + if (!reached_main) + return ""; + + /* + * Use our getenv because libsanitizer gets confused by ps_status.c + * overwriting the environ block. + */ + return getenv("UBSAN_OPTIONS"); +} -- 2.35.1.354.g715d08a9e5
>From 4886f82cc8dcb7de19f191dc58c7beda3417a1f4 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Wed, 23 Mar 2022 09:44:19 -0700 Subject: [PATCH v3 3/5] Fix two ubsan violations. --- src/backend/utils/cache/relcache.c | 2 +- src/backend/utils/misc/guc.c | 5 ++++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index fbd11883e17..3d05297b0d9 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -6528,7 +6528,7 @@ write_item(const void *data, Size len, FILE *fp) { if (fwrite(&len, 1, sizeof(len), fp) != sizeof(len)) elog(FATAL, "could not write init file"); - if (fwrite(data, 1, len, fp) != len) + if (len > 0 && fwrite(data, 1, len, fp) != len) elog(FATAL, "could not write init file"); } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 932aefc777d..fb88a5e5828 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -9797,7 +9797,10 @@ GetConfigOptionByNum(int varnum, const char **values, bool *noshow) values[4] = _(conf->short_desc); /* extra_desc */ - values[5] = _(conf->long_desc); + if (conf->long_desc) + values[5] = _(conf->long_desc); + else + values[5] = ""; /* context */ values[6] = GucContext_Names[conf->context]; -- 2.35.1.354.g715d08a9e5
>From e5e69c83957ece0086c2b7a958152857b5930050 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Sun, 23 Jan 2022 22:20:10 -0800 Subject: [PATCH v3 4/5] ci: use -fsanitize=undefined,alignment in linux tasks. --- .cirrus.yml | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/.cirrus.yml b/.cirrus.yml index e5335fede76..820aafce292 100644 --- a/.cirrus.yml +++ b/.cirrus.yml @@ -157,6 +157,21 @@ task: LINUX_CONFIGURE_FEATURES: *LINUX_CONFIGURE_FEATURES + # Enable a reasonable set of sanitizers. Use the linux task for that, as + # it currently is the fastest task. Also several of the sanitizers work + # best on linux. + # The overhead of alignment sanitizer is low, undefined behaviour has + # moderate overhead. address sanitizer howerever is pretty expensive and + # thus not enabled. + # + # disable_coredump=0, abort_on_error=1: for useful backtraces in case of crashes + # print_stacktraces=1,verbosity=2, duh + # detect_leaks=0: too many uninteresting leak errors in short-lived binaries + UBSAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2 + ASAN_OPTIONS: detect_leaks=0:disable_coredump=0:abort_on_error=1 + EXTRA_CFLAGS: -fsanitize=alignment,undefined -fno-sanitize-recover=all + EXTRA_LDFLAGS: -fno-common + only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*linux.*' compute_engine_instance: @@ -201,8 +216,9 @@ task: CC="ccache gcc" \ CXX="ccache g++" \ CLANG="ccache clang" \ - CFLAGS="-Og -ggdb" \ - CXXFLAGS="-Og -ggdb" + CFLAGS="-Og -ggdb ${EXTRA_CFLAGS}" \ + CXXFLAGS="-Og -ggdb ${EXTRA_CXXFLAGS}" \ + LDFLAGS="${EXTRA_LDFLAGS}" EOF build_script: su postgres -c "make -s -j${BUILD_JOBS} world-bin" upload_caches: ccache -- 2.35.1.354.g715d08a9e5