So here's some more diagnostic at the point of the SEGV: (gdb) disass Dump of assembler code for function _$SYSTEM$_Ll1637: => 0x0118ace1 <+0>: cmpl $0x0,(%edx) End of assembler dump. (gdb) i reg eax 0xb6c77158 -1228443304 ecx 0xb6c76c04 -1228444668 edx 0xfffffff8 -8 ebx 0x12adbf8 19586040 esp 0xb6c75f5c 0xb6c75f5c ebp 0xb6c75f70 0xb6c75f70 esi 0xb6c77020 -1228443616 edi 0xb6c77020 -1228443616 eip 0x118ace1 0x118ace1 <_$SYSTEM$_Ll1637> eflags 0x210293 [ CF AF SF IF RF ID ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb) p $eax^ $4 = 0
This tells me that the test at the top of fpc_AnsiStr_Decr_Ref: cmpl $0,(%eax) jne .Ldecr_ref_continue ret .Ldecr_ref_continue: passed (i.e. (%eax) was NOT nil) but sometime during the execution of the following code: // Temps allocated between ebp-24 and ebp+0 subl $4,%esp // Var S located in register // Var l located in register movl %eax,(%esp) // [101] l:=@PAnsiRec(S-FirstOff)^.Ref; movl (%eax),%edx subl $8,%edx // [102] If l^<0 then exit; cmpl $0,(%edx) the variable (%eax) MUST have been changed (to nil) BY ANOTHER THREAD. Is there any other plausible explanation I may have missed? If there is no other explanation, then it means I need to find out how the string variable referred to by (%eax) could have been been accessed (or even known to exist) by any other thread in the same address space. If that variable is local to a function (i.e. foo's Result with SEGV upon its assignment immediately it first comes into scope, per my earlier email) then absent a bug in FPC's handling string references and allocation, it seems impossible that it could be known or referenced by any other other thread. I'm reasonably confident there's no other way it could be overwritten by another thread (i.e. I don't think there are any range or buffer pointer errors anywhere else) so logic tells me I must have the wrong thesis or there's a string handling error in FPC. Any clues or insight, gratefully received :-) Cheers, Bruce. PS: I can't use valgrind in practice for a variety of reasons, not the least of which is that I'm not likely to see the error for an extraordinary long time given that slight changes to the (execution time of the) code made so far have had a dramatic effect on the likelihood of the occurrence of this problem at all but it's clearly some sort of race condition over unprotected memory somewhere. On Thu, May 9, 2013 at 9:47 AM, Bruce Tulloch <pas...@causal.com> wrote: > I've not managed to trap it again, but based on the information I have > from the last time it occurred I can say the error happened here: > > --- a/rtl/i386/i386.inc > +++ b/rtl/i386/i386.inc > @@ -1523,7 +1523,7 @@ > movl (%eax),%edx > subl $8,%edx > // [102] If l^<0 then exit; > cmpl $0,(%edx) <-- SEGV OCCURS HERE > jl .Lj3596 > .Lj3603: > // [104] If declocked(l^) then > > That is, when testing the string length, the address of the length > variable appears to be duff. > > I don't know what %edx was pointing to at the time (I hope to know next > time I trap it) but it was obviously wrong. > > -b > > > On Thu, May 9, 2013 at 9:33 AM, Bruce Tulloch <pas...@causal.com> wrote: > >> Thanks Jonas, that confirms what I suspected. Next time I trap an >> instance of this (rare) fault I will inspect exactly which CPU instruction >> raised the SEGV inside fpc_AnsiStr_Decr_Ref in search of a source of memory >> corruption. >> >> >> Bruce. >> >> >> On Wed, May 8, 2013 at 11:49 PM, Jonas Maebe >> <jonas.ma...@elis.ugent.be>wrote: >> >>> >>> On 08 May 2013, at 08:13, Bruce Tulloch wrote: >>> >>> After a random but very long period of time (i.e. very many successful >>>> calls) I get a SEGV in the built-in function fpc_AnsiStr_Decr_Ref. >>>> >>>> GDB reports the argument to fpc_AnsiStr_Decr_Ref (the string who's >>>> reference is to be decremented) is nil (i.e. 0x0). >>>> >>>> Prima facie, that's the reason for the SEGV, but how is it possible that >>>> the compiler would pass a nil pointer to this function the first place? >>>> >>> >>> The first thing fpc_AnsiStr_Decr_Ref does is check whether its parameter >>> is nil, and if so it immediately exists. It can be nil in case the >>> ansistring contains an empty string. >>> >>> That routine itself also sets its argument to nil in case this was not >>> the case initially (it's a var-parameter), and I assume your crash happens >>> after this has been done. >>> >>> >>> To put this into context, I'm running FPC 2.6.2 on a 32 bit Linux system >>>> executing in a multi-threaded application (which uses python threads and >>>> fpc threads). I have not found obvious evidence of memory corruption >>>> from >>>> other execution contexts or shared memory handling problems. >>>> >>> >>> It's nevertheless most likely memory corruption. You can try compiling >>> with -gv and running your program under valgrind to see whether it finds >>> anything (you will probably get some false positives about certain RTL >>> pchar routines such as strscan and strlen, but you can ignore those). >>> >>> >>> Jonas >>> ______________________________**_________________ >>> fpc-pascal maillist - >>> fpc-pascal@lists.freepascal.**org<fpc-pascal@lists.freepascal.org> >>> http://lists.freepascal.org/**mailman/listinfo/fpc-pascal<http://lists.freepascal.org/mailman/listinfo/fpc-pascal> >>> >> >> >
_______________________________________________ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal