On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabix...@yahoo.com> wrote: > >> On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+...@gmail.com> wrote: >> >>> On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <dan...@turtleware.eu> wrote: >>> From the backtrace it is sure that fail is caused inside the call to >>> GC_init. Such errors are known to have happened when another GC was >>> initialized already on the system (I've linked the issue). It might be >>> caused by something else in bdwgc, I don't know. Either way I'd focus on >>> GC_init part. >> >> Our project (sagemath) only uses libgc within the embedded ECL. Thus I >> am really puzzled how another libgc instance might kick in and spoil >> the game for ECL. >> >> One possibility is that clang is using libgc, and thus, in principle, >> libgc might be sitting somewhere in the runtime?! >> >> >>> >>> To make sure, that I'm right with my assertion you may put printf before and >>> after call to GC_init. I'm not quite familiar with bdwgc internals to say, >>> what is wrong though. Maybe updating bundled sources of GC will help? Or >>> linking with libgc on the system? It might be that it was a bug in bdwgc >>> which got already fixed. >> >> We are not using the bdwgc shipped with ECL, we use a separate libgc >> 7.6.0, which is the latest stable. >> (Is there a reason to ship bdwgc sources with ECL - do you patch it?) >> > > I'm using ecl with the non embedded bdwgc as well and I don't have issue. > > Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.
here is a part of a stacktrace from the debugger, in a scenario where a call to embedded ECL from Python leads to a ECL's stack overflow, on an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence does not lead to crashes). There is no mention of GC in the stacktrace. This looks to me as a lack of thread safety on ECL side, although I might be wrong. ... frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549 frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76 frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286 frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79 frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164 frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831 frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084 frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682 frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762 frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745 frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340 frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989 ... frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221 frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620 frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325 > >> Thanks, >> Dima >> >>> >>> Regards, >>> >>> Daniel >>> >>> >>> >>>> On 04.09.2017 12:04, Dima Pasechnik wrote: >>>> >>>> On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <dan...@turtleware.eu > >>>> wrote: >>>>> >>>>> I dont think its related to shared vs static - rather two gc running >>>>> concurrently. Try commenting out GC_init call in ecl and see what >>>>> happens. >>>> >>>> I don't understand how two GCs can run concurrently on a memory region >>>> controlled by ECL which is statically linked to GC... >>>> In fact I am pretty sure no other instances of GC are running anywhere >>>> within our process tree. >>>> >>>> By the way, I don't know whether it's obvious from the backtrace that >>>> cl_boot() has been completed, or not. >>>> >>>> If it actually was completed, could it be a bug that invalidates the >>>> bit indicating that cl_boot() has been done? >>>> >>>> We have seen similar troubles with clang recently, related to FPE. >>>> There an FPE bit was flipped by assignment of a double to an >>>> integer type (sic!). >>>> It took us a lot of head banging on various hard surfaces to debug this: >>>> https://trac.sagemath.org/ticket/22799 >>>> it turned out we did hit a known bug: >>>> https://bugs.llvm.org//show_bug.cgi?id=17686 >>>> >>>>> Do you need sigchld for anything? Run-program was rewritten and sigchld >>>>> handling wasnt viable option anymore for it. >>>>> >>>> We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we >>>> now can simply skip it all together. >>>> >>>> Thanks, >>>> Dima >>>> >>>>> Im on phone, will be avail after the weekend. >>>>> >>>>> Regards, D. >>>>> >>>>> >>>>> Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik >>>>> <dimpase+...@gmail.com> >>>>> napisał(a): >>>>>> >>>>>> Hi Daniel, >>>>>> Thanks for the message. The scenario you talk about only happens if GC >>>>>> is a shared library, right? >>>>>> >>>>>> I've rebuilt GC disabling shared libs, and ECL doing static linking to >>>>>> GC. >>>>>> And I still get very similar segfaults: >>>>>> >>>>>> ;;; ECL C Backtrace >>>>>> ;;; 0 ecl_internal_error (0x87d79b375) >>>>>> ;;; 1 init_unixint (0x87d7c17e0) >>>>>> ;;; 2 init_unixint (0x87d7c1582) >>>>>> ;;; 3 pthread_sigmask (0x80103779d) >>>>>> ;;; 4 pthread_getspecific (0x801036d6f) >>>>>> ;;; 5 unknown (0x7ffffffff193) >>>>>> ;;; 6 GC_push_current_stack (0x87d7ef7c3) >>>>>> ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360) >>>>>> ;;; 8 GC_push_roots (0x87d7ef9c2) >>>>>> ;;; 9 GC_mark_some (0x87d7ec97c) >>>>>> ;;; 10 GC_stopped_mark (0x87d7e6b7a) >>>>>> ;;; 11 GC_try_to_collect_inner (0x87d7e6a75) >>>>>> ;;; 12 GC_init (0x87d7f08ea) >>>>>> ;;; 13 init_alloc (0x87d7d5669) >>>>>> ;;; 14 cl_boot (0x87d69f66b) >>>>>> ... >>>>>> >>>>>> And a very similar picture on the develop branch of ECL - although >>>>>> I had to change our code, as in particular >>>>>> ECL_OPT_TRAP_SIGCHLD is gone... >>>>>> >>>>>> So, what can it be? Some signals issue? >>>>>> >>>>>> Thanks, >>>>>> Dima >>>>>> >>>>>> On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański < dan...@turtleware.eu> >>>>>> wrote: >>>>>>> >>>>>>> Hey Dima, >>>>>>> >>>>>>> this looks like the issue with having GC initialized before ECL kicks >>>>>>> in. >>>>>>> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a >>>>>>> discussion about this problem. Basically some other component already >>>>>>> called >>>>>>> GC_init and ECL calls it once more. It's arguably not a bug. >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Daniel >>>>>>> >>>>>>> >>>>>>>> On 31.08.2017 15:29, Dima Pasechnik wrote: >>>>>>>> >>>>>>>> >>>>>>>> Dear all, >>>>>>>> >>>>>>>> I'm struggling to understand strange segfaults coming from >>>>>>>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look as >>>>>>>> follows: >>>>>>>> >>>>>>>> Got signal before environment was installed on our thread >>>>>>>> [2: No such file or directory] >>>>>>>> >>>>>>>> ;;; ECL C Backtrace >>>>>>>> ;;; 0 ecl_internal_error (0x87d790765) >>>>>>>> ;;; 1 init_unixint (0x87d7b6bd0) >>>>>>>> ;;; 2 init_unixint (0x87d7b6972) >>>>>>>> ;;; 3 pthread_sigmask (0x80103779d) >>>>>>>> ;;; 4 pthread_getspecific (0x801036d6f) >>>>>>>> ;;; 5 unknown (0x7ffffffff193) >>>>>>>> ;;; 6 GC_push_all_stacks (0x87db1ea2c) >>>>>>>> ;;; 7 GC_mark_some (0x87db12eec) >>>>>>>> ;;; 8 GC_stopped_mark (0x87db09baa) >>>>>>>> ;;; 9 GC_try_to_collect_inner (0x87db09a75) >>>>>>>> ;;; 10 GC_init (0x87db16f4f) >>>>>>>> ;;; 11 init_alloc (0x87d7caa59) >>>>>>>> ;;; 12 cl_boot (0x87d694a5b) >>>>>>>> ;;; 13 initecl (0x87d218340) >>>>>>>> ;;; 14 initecl (0x87d20a43f) >>>>>>>> ;;; 15 initecl (0x87d207e28) >>>>>>>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c) >>>>>>>> ;;; 17 PyImport_AppendInittab (0x800b3d71f) >>>>>>>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8) >>>>>>>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce) >>>>>>>> ;;; 20 _PyBuiltin_Init (0x800b162d7) >>>>>>>> ;;; 21 PyObject_Call (0x800a7d3e3) >>>>>>>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c) >>>>>>>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4) >>>>>>>> ;;; 24 PyEval_EvalCode (0x800b1ad96) >>>>>>>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11) >>>>>>>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8) >>>>>>>> ;;; 27 PyImport_AppendInittab (0x800b3d71f) >>>>>>>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8) >>>>>>>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce) >>>>>>>> ;;; 30 _PyBuiltin_Init (0x800b162d7) >>>>>>>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1) >>>>>>>> Segmentation fault (core dumped) >>>>>>>> >>>>>>>> It looks as if ECL (version 16.1.2) is being called before an >>>>>>>> initialisation is complete, but it it possible to say more without a >>>>>>>> debugger? >>>>>>>> >>>>>>>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0 >>>>>>>> with libatomic_ops version 7.4.6. >>>>>>>> And only reproducible on FreeBSD. >>>>>>>> >>>>>>>> ECL is built with --disable-threads; GC is built with or without >>>>>>>> threads---result is still the same. >>>>>>>> (so it's unclear to me where pthread_* calls in the trace >>>>>>>> come from). >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Dima >>>>>>>> >>>>>>>> PS. the segfault is at the bottom of >>>>>>>> https://trac.sagemath.org/ticket/22679#comment:87