Buildfarm member tick failed today with what appears to be a bug introduced by (or at least exposed by) the recent plpython ereport patch. Presumably the fact that it's failing and no other critters are has to do with its use of -DCLOBBER_CACHE_ALWAYS and/or -DRANDOMIZE_ALLOCATED_MEMORY. However, I cannot reproduce the failure by using those switches, even though tick's CentOS platform should be pretty much identical to my RHEL6 machine. Can anyone else reproduce it?
After failing at that, it occurred to me to try it under valgrind, and kaboom! I found a *different* bug, which has apparently been there a long time. (I say different because I don't see how this one could produce tick's symptoms; it's a reference to already-freed strings, but not an attempt to pfree one.) I'll be fixing this one shortly, but now we have another puzzle: why isn't buildfarm member skink seeing the same failure? It is running the plpython tests. Can anyone else reproduce a failure by valgrind'ing the plpython tests? It looks here like ==00:00:00:29.670 24996== Invalid read of size 1 ==00:00:00:29.670 24996== at 0x4A07F92: strlen (mc_replace_strmem.c:403) ==00:00:00:29.670 24996== by 0x826860: MemoryContextStrdup (mcxt.c:1158) ==00:00:00:29.670 24996== by 0x800441: set_errdata_field (elog.c:1230) ==00:00:00:29.670 24996== by 0x803EE8: err_generic_string (elog.c:1210) ==00:00:00:29.670 24996== by 0xDFEF382: PLy_elog (plpy_elog.c:117) ==00:00:00:29.670 24996== by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084) ==00:00:00:29.670 24996== by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105) ==00:00:00:29.670 24996== by 0xDFF25C6: plpython_inline_handler (plpy_main.c:345) ==00:00:00:29.670 24996== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) ==00:00:00:29.670 24996== by 0x70E97F: standard_ProcessUtility (utility.c:515) ==00:00:00:29.670 24996== by 0x70A856: PortalRunUtility (pquery.c:1175) ==00:00:00:29.670 24996== by 0x70B8FF: PortalRunMulti (pquery.c:1306) ==00:00:00:29.670 24996== Address 0xefe2894 is 36 bytes inside a block of size 49 free'd ==00:00:00:29.670 24996== at 0x4A06430: free (vg_replace_malloc.c:446) ==00:00:00:29.670 24996== by 0x398AE90D5C: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE79E3A: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE5C586: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE5C5BF: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE9A704: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0xDFEECFE: PLy_traceback (plpy_elog.c:358) ==00:00:00:29.670 24996== by 0xDFEF21E: PLy_elog (plpy_elog.c:83) ==00:00:00:29.670 24996== by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084) ==00:00:00:29.670 24996== by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105) ==00:00:00:29.670 24996== by 0xDFF25C6: plpython_inline_handler (plpy_main.c:345) ==00:00:00:29.670 24996== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) the core of the problem being that PLy_get_spi_error_data returns a pointer to a string that is pointing into the "val" (a/k/a "v") Python object, which PLy_traceback has carefully released the last refcount on. So this coding should have failed immediately under any valgrind testing. The ereport patch perhaps gave us more scope for the error, but for me, valgrind'ing the plpython tests fails similarly in 9.5. So we should have noticed this long since. I dislike bugs that are platform-dependent with no obvious reason for it :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers