On 2018-01-04 11:20:33 -0800, Andres Freund wrote: > On 2018-01-04 12:11:37 -0500, Tom Lane wrote: > > Robert Haas <robertmh...@gmail.com> writes: > > > On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > >> Also, what the devil is happening on skink? > > > > > So, skink is apparently dying during shutdown of a user-connected > > > backend, and specifically the one that executed the 'tablespace' test. > > > > Well, yeah, valgrind is burping: the postmaster log is full of > > > > ==10544== VALGRINDERROR-BEGIN > > ==10544== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) > > ==10544== at 0x7011490: epoll_pwait (epoll_pwait.c:42) > > ==10544== by 0x4BF40B: WaitEventSetWaitBlock (latch.c:1048) > > ==10544== by 0x4BF40B: WaitEventSetWait (latch.c:1000) > > ==10544== by 0x3C0B3B: secure_read (be-secure.c:166) > > ==10544== by 0x3CCD9E: pq_recvbuf (pqcomm.c:963) > > ==10544== by 0x3CDA07: pq_getbyte (pqcomm.c:1006) > > ==10544== by 0x4E2A2D: SocketBackend (postgres.c:339) > > ==10544== by 0x4E444E: ReadCommand (postgres.c:512) > > ==10544== by 0x4E7588: PostgresMain (postgres.c:4085) > > ==10544== by 0x4641D0: BackendRun (postmaster.c:4412) > > ==10544== by 0x467308: BackendStartup (postmaster.c:4084) > > ==10544== by 0x4675F7: ServerLoop (postmaster.c:1757) > > ==10544== by 0x4689D4: PostmasterMain (postmaster.c:1365) > > ==10544== Address 0x0 is not stack'd, malloc'd or (recently) free'd > > ==10544== > > ==10544== VALGRINDERROR-END > > > > But (a) this is happening in multiple branches, and (b) we've not > > changed anything near that code in awhile. I think something broke > > in valgrind itself. > > Some packages on skink have been upgraded. It appears that there either > was a libc or valgrind change that made valgrind not recognize that a > pointer of 0 might not point anywhere :(
It looks like recent glibc implements epoll_wait() as epoll_pwait() with a NULL sigmask argument, and valgrind isn't happy with it. > Let me check whether valgrind accept multiple suppression files, in > which case I could add a suppression for this error to all > branches. Done that. I've added a valgrind-global.supp that has: { null-sigmask-to-pwait-means-no-sigmask Memcheck:Param epoll_pwait(sigmask) fun:epoll_pwait fun:WaitEventSetWaitBlock } > Will also check whether I can reproduce locally. Locally the above suppression allowed me to continue a bit further. But I promptly hit other new blurbs: assert build: 2018-01-04 12:03:45.927 PST [5716][3/189] STATEMENT: select format('%1$s %4$s', 1, 2, 3); ==5718== Invalid write of size 8 ==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117) ==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788) ==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282) ==5718== by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602) ==5718== by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197) ==5718== by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230) ==5718== by 0xE6B2C04: exec_stmt (pl_exec.c:1632) ==5718== by 0xE6B2956: exec_stmts (pl_exec.c:1535) ==5718== by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473) ==5718== by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474) ==5718== by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330) ==5718== by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327) ==5718== by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234) ==5718== by 0x5F92EB: standard_ProcessUtility (utility.c:532) ==5718== by 0x5F8E26: ProcessUtility (utility.c:357) ==5718== by 0x5F7D9F: PortalRunUtility (pquery.c:1178) ==5718== by 0x5F7FC0: PortalRunMulti (pquery.c:1324) ==5718== by 0x5F74DE: PortalRun (pquery.c:799) ==5718== by 0x5F10BE: exec_simple_query (postgres.c:1120) ==5718== by 0x5F5640: PostgresMain (postgres.c:4143) ==5718== Address 0x20 is not stack'd, malloc'd or (recently) free'd ==5718== { <insert_a_suppression_name_here> Memcheck:Addr8 fun:ExecInterpExpr fun:ExecInterpExprStillValid fun:ExecEvalExpr fun:exec_eval_simple_expr fun:exec_eval_expr fun:exec_stmt_raise fun:exec_stmt fun:exec_stmts fun:exec_stmt_block fun:plpgsql_exec_function fun:plpgsql_inline_handler fun:OidFunctionCall1Coll fun:ExecuteDoStmt fun:standard_ProcessUtility fun:ProcessUtility fun:PortalRunUtility fun:PortalRunMulti fun:PortalRun fun:exec_simple_query fun:PostgresMain } ==5718== ==5718== Process terminating with default action of signal 11 (SIGSEGV) ==5718== Access not within mapped region at address 0x20 ==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117) ==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788) ==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282) ==5718== by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602) ==5718== by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197) ==5718== by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230) ==5718== by 0xE6B2C04: exec_stmt (pl_exec.c:1632) ==5718== by 0xE6B2956: exec_stmts (pl_exec.c:1535) ==5718== by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473) ==5718== by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474) ==5718== by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330) ==5718== by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327) ==5718== by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234) ==5718== by 0x5F92EB: standard_ProcessUtility (utility.c:532) ==5718== by 0x5F8E26: ProcessUtility (utility.c:357) ==5718== by 0x5F7D9F: PortalRunUtility (pquery.c:1178) ==5718== by 0x5F7FC0: PortalRunMulti (pquery.c:1324) ==5718== by 0x5F74DE: PortalRun (pquery.c:799) ==5718== by 0x5F10BE: exec_simple_query (postgres.c:1120) ==5718== by 0x5F5640: PostgresMain (postgres.c:4143) ==5718== If you believe this happened as a result of a stack ==5718== overflow in your program's main thread (unlikely but ==5718== possible), you can try to increase the size of the ==5718== main thread stack using the --main-stacksize= flag. ==5718== The main thread stack size used in this run was 8388608. This looks like it might be related to recent plpgsql changes. Looks like some pointer in: /* call input function (similar to InputFunctionCall) */ if (!op->d.iocoerce.finfo_in->fn_strict || str != NULL) is NULL. ==4286== Conditional jump or move depends on uninitialised value(s) ==4286== at 0x6BA8A17: __wcsnlen_avx2 (strlen-avx2.S:261) ==4286== by 0x6AF2FF1: wcsrtombs (wcsrtombs.c:104) ==4286== by 0x6A88A40: wcstombs (wcstombs.c:34) ==4286== by 0x514772: lowerstr_with_len.part.1 (ts_locale.c:284) ==4286== by 0x523A63: readstoplist (ts_utils.c:119) ==4286== by 0xEAB2839: dsnowball_init (dict_snowball.c:201) ==4286== by 0x684975: OidFunctionCall1Coll (fmgr.c:1327) ==4286== by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337) ==4286== by 0x514E02: LexizeExec (ts_parse.c:203) ==4286== by 0x5153ED: parsetext (ts_parse.c:405) ==4286== by 0x521DD2: to_tsvector_byid (to_tsany.c:249) ==4286== by 0x380261: ExecInterpExpr (execExprInterp.c:664) ==4286== by 0x456069: ExecEvalExprSwitchContext (executor.h:301) ==4286== by 0x456069: evaluate_expr (clauses.c:4873) ==4286== by 0x4588B5: evaluate_function (clauses.c:4424) ==4286== by 0x4588B5: simplify_function (clauses.c:4064) ==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676) ==4286== by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903) ==4286== by 0x458E43: simplify_function (clauses.c:4055) ==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676) ==4286== by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614) ==4286== by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671) ==4286== Uninitialised value was created by a heap allocation ==4286== at 0x4C2BB76: malloc (vg_replace_malloc.c:299) ==4286== by 0x6A9667: AllocSetAlloc (aset.c:945) ==4286== by 0x6B5AC7: palloc (mcxt.c:848) ==4286== by 0x5238EA: get_tsearch_config_filename (ts_utils.c:55) ==4286== by 0x5239AF: readstoplist (ts_utils.c:75) ==4286== by 0xEAB2839: dsnowball_init (dict_snowball.c:201) ==4286== by 0x684975: OidFunctionCall1Coll (fmgr.c:1327) ==4286== by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337) ==4286== by 0x514E02: LexizeExec (ts_parse.c:203) ==4286== by 0x5153ED: parsetext (ts_parse.c:405) ==4286== by 0x521DD2: to_tsvector_byid (to_tsany.c:249) ==4286== by 0x380261: ExecInterpExpr (execExprInterp.c:664) ==4286== by 0x456069: ExecEvalExprSwitchContext (executor.h:301) ==4286== by 0x456069: evaluate_expr (clauses.c:4873) ==4286== by 0x4588B5: evaluate_function (clauses.c:4424) ==4286== by 0x4588B5: simplify_function (clauses.c:4064) ==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676) ==4286== by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903) ==4286== by 0x458E43: simplify_function (clauses.c:4055) ==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676) ==4286== by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614) ==4286== by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671) ==4286== { <insert_a_suppression_name_here> Memcheck:Cond fun:__wcsnlen_avx2 fun:wcsrtombs fun:wcstombs fun:lowerstr_with_len.part.1 fun:readstoplist fun:dsnowball_init fun:OidFunctionCall1Coll fun:lookup_ts_dictionary_cache fun:LexizeExec fun:parsetext fun:to_tsvector_byid fun:ExecInterpExpr fun:ExecEvalExprSwitchContext fun:evaluate_expr fun:evaluate_function fun:simplify_function fun:eval_const_expressions_mutator fun:expression_tree_mutator fun:simplify_function fun:eval_const_expressions_mutator fun:expression_tree_mutator fun:eval_const_expressions_mutator } I don't recall any recent changes in the area, so this might be a glibc internal issue. Greetings, Andres Freund