[HACKERS] Weird irreproducible behaviors in plpython tests

Tom Lane Sun, 10 Apr 2016 14:12:07 -0700

Buildfarm member tick failed today with what appears to be a bug
introduced by (or at least exposed by) the recent plpython ereport
patch.  Presumably the fact that it's failing and no other critters are
has to do with its use of -DCLOBBER_CACHE_ALWAYS and/or
-DRANDOMIZE_ALLOCATED_MEMORY.  However, I cannot reproduce the failure
by using those switches, even though tick's CentOS platform should be
pretty much identical to my RHEL6 machine.  Can anyone else reproduce it?


After failing at that, it occurred to me to try it under valgrind,
and kaboom!  I found a *different* bug, which has apparently been
there a long time.  (I say different because I don't see how this
one could produce tick's symptoms; it's a reference to already-freed
strings, but not an attempt to pfree one.)  I'll be fixing this one
shortly, but now we have another puzzle: why isn't buildfarm member
skink seeing the same failure?  It is running the plpython tests.
Can anyone else reproduce a failure by valgrind'ing the plpython
tests?  It looks here like

==00:00:00:29.670 24996== Invalid read of size 1
==00:00:00:29.670 24996==    at 0x4A07F92: strlen (mc_replace_strmem.c:403)
==00:00:00:29.670 24996==    by 0x826860: MemoryContextStrdup (mcxt.c:1158)
==00:00:00:29.670 24996==    by 0x800441: set_errdata_field (elog.c:1230)
==00:00:00:29.670 24996==    by 0x803EE8: err_generic_string (elog.c:1210)
==00:00:00:29.670 24996==    by 0xDFEF382: PLy_elog (plpy_elog.c:117)
==00:00:00:29.670 24996==    by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084)
==00:00:00:29.670 24996==    by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105)
==00:00:00:29.670 24996==    by 0xDFF25C6: plpython_inline_handler 
(plpy_main.c:345)
==00:00:00:29.670 24996==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)
==00:00:00:29.670 24996==    by 0x70E97F: standard_ProcessUtility 
(utility.c:515)
==00:00:00:29.670 24996==    by 0x70A856: PortalRunUtility (pquery.c:1175)
==00:00:00:29.670 24996==    by 0x70B8FF: PortalRunMulti (pquery.c:1306)
==00:00:00:29.670 24996==  Address 0xefe2894 is 36 bytes inside a block of size 
49 free'd
==00:00:00:29.670 24996==    at 0x4A06430: free (vg_replace_malloc.c:446)
==00:00:00:29.670 24996==    by 0x398AE90D5C: ??? (in 
/usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE79E3A: ??? (in 
/usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE5C586: ??? (in 
/usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE5C5BF: ??? (in 
/usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE9A704: ??? (in 
/usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0xDFEECFE: PLy_traceback (plpy_elog.c:358)
==00:00:00:29.670 24996==    by 0xDFEF21E: PLy_elog (plpy_elog.c:83)
==00:00:00:29.670 24996==    by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084)
==00:00:00:29.670 24996==    by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105)
==00:00:00:29.670 24996==    by 0xDFF25C6: plpython_inline_handler 
(plpy_main.c:345)
==00:00:00:29.670 24996==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)

the core of the problem being that PLy_get_spi_error_data returns a
pointer to a string that is pointing into the "val" (a/k/a "v") Python
object, which PLy_traceback has carefully released the last refcount on.
So this coding should have failed immediately under any valgrind
testing.  The ereport patch perhaps gave us more scope for the error,
but for me, valgrind'ing the plpython tests fails similarly in 9.5.
So we should have noticed this long since.

I dislike bugs that are platform-dependent with no obvious
reason for it :-(

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Weird irreproducible behaviors in plpython tests

Reply via email to