Re: [go-nuts] Accessing *[]uint64 from assembly - strange memory corruption under heavy load - any ideas?

jsonp via golang-nuts Fri, 22 Mar 2019 09:31:49 -0700

I'm not making any function calls in the assembly, just writing to memory 
addresses that represent the elements / len of the slice. I've also tried 
using LockOSThread() to see if that made any difference, alas it does not.


On Friday, March 22, 2019 at 4:59:30 AM UTC-7, Robert Engels wrote:
>
> Are you making any calls modifying the len that would allow GC to occur, 
> or change stack size? You might need to pin the Go routine so that the 
> operation you are performing is “atomic” with respect to those. 
>
> This also sounds very scary if the Go runtime every had a compacting 
> collector. 
>
> On Mar 22, 2019, at 12:27 AM, Tom <hype...@gmail.com <javascript:>> wrote:
>
> The allocation is in go, and assembly never modifies the size of the 
> backing array. Assembly only ever modifies len, which is the len of the 
> slice and not the backing array.
>
> On Thursday, 21 March 2019 22:18:29 UTC-7, Tamás Gulácsi wrote:
>>
>> 2019. március 22., péntek 6:06:06 UTC+1 időpontban Tom a következőt írta:
>>>
>>> Still errors I'm afraid :/
>>>
>>> On Thursday, 21 March 2019 21:54:59 UTC-7, Ian Lance Taylor wrote:
>>>>
>>>> On Thu, Mar 21, 2019 at 9:39 PM Tom <hype...@gmail.com> wrote: 
>>>> > 
>>>> > I've been stuck on this for a few days so thought I would ask the 
>>>> brains trust. 
>>>> > 
>>>> > TL;DR: When I have native amd64 instructions mutating (updating the 
>>>> len + values of a []uint64) a slice, I experience spurious & random memory 
>>>> corruption when under heavy load (# runnable goroutines > MAXPROCS, doing 
>>>> the same thing continuously), and only when the GC is enabled. Any 
>>>> debugging ideas or things I should look into? 
>>>> > 
>>>> > Background: 
>>>> > 
>>>> > I'm calling into go assembly with a few pointers to slices 
>>>> (*[]uint64), and that assembly is mutating them (reading/writing values, 
>>>> updating len within capacity). I'm experiencing random memory corruption, 
>>>> but I can only trigger it in the following scenarios: 
>>>> > 
>>>> > Heavy load - Doing a zillion things at once (specifically running all 
>>>> my test cases in parallel) and maxing out my machine. 
>>>> > Parallelism - A panic due to memory corruption happens faster if 
>>>> --parallel is set higher, and never if not in parallel. 
>>>> > GC - The panic never happens if the GC is disabled (of course, the 
>>>> test process eventually runs out of memory). 
>>>> > 
>>>> > The memory corruption varies, but usually results in an element of an 
>>>> unrelated slice being zero'ed, the len of a unrelated slice being zeroed, 
>>>> or (less likely) a segfault. 
>>>> > 
>>>> > Tested on go1.11.2 and go1.12.1. I can only trigger this if I run all 
>>>> my test cases at once (with --count at 8000 or so & using t.Parallel()). 
>>>> Running thing serially or individually yields the correct behaviour. 
>>>> > 
>>>> > The assembly in question looks like this: 
>>>> > 
>>>> > TEXT ·jitcall(SB),NOSPLIT|NOFRAME,$0-24 
>>>> >         GO_ARGS 
>>>> >         MOVQ asm+0(FP),     AX  // Load the address of the assembly 
>>>> section. 
>>>> >         MOVQ stack+8(FP),   R10 // Load the address of the 1st slice. 
>>>> >         MOVQ locals+16(FP), R11 // Load the address of the 2nd slice. 
>>>> >         MOVQ 0(AX),         AX  // Deference pointer to native code. 
>>>> >         JMP AX                  // Jump to native code. 
>>>> > 
>>>> > And slice manipulation like this (this is a 'pop'): 
>>>> > 
>>>> >  MOVQ r13,     [r10+8]       // Load the length of the slice. 
>>>> >  DECQ r13                    // Decrements the len (I can guarantee 
>>>> this will never underflow). 
>>>> >  MOVQ r12,     [r10]         // Load the 0th element address. 
>>>> >  LEAQ r12,     [r12 + r13*8] // Compute the address of the last 
>>>> element. 
>>>> >  MOVQ reg,     [r12]         // Load the element to reg. 
>>>> >  MOVQ [r10+8], r13           // Write the len back. 
>>>> > 
>>>> > or 'push' like this (note: cap is always large enough for any pushes) 
>>>> ... 
>>>> > 
>>>> >  MOVQ r12,     [r10]          // Load the 0th element address. 
>>>> >  MOVQ r13,     [r10+8]        // Load the len. 
>>>> >  LEAQ r12,     [r12 + r13*8]  // Compute the address of the last 
>>>> element + 1. 
>>>> >  INCQ r13                     // Increment the len. 
>>>> >  MOVQ [r10+8], r13            // Save the len. 
>>>> >  MOVQ [r12],   reg            // Write the new element. 
>>>> > 
>>>> > 
>>>> > I acknowledge that calling into code like this is unsupported, but I 
>>>> struggle to understand how such corruption can happen, and having stared 
>>>> at 
>>>> it for a few days, I am frankly stumped. I mean, even if non-cooperative 
>>>> preemption was in these versions of Go I would expect the GC to  abort 
>>>> when 
>>>> it cant find the stack maps for my RIP value. With no GC safe points in my 
>>>> native assembly, I dont see how the GC could interfere (yet the issue 
>>>> disappears with the GC off??). 
>>>> > 
>>>> > Questions: 
>>>> > 
>>>> > Any ideas what I'm doing wrong? 
>>>> > Any ideas how I can trace this from the application side and also the 
>>>> runtime side? I've tried schedtrace and the like, but the output didnt 
>>>> appear useful or correlated to the crashes. 
>>>> > Any suggestions for assumptions I might have missed and should write 
>>>> tests / guards for? 
>>>>
>>>>
>>>>
>> Do the allocation in Go, don't modify the slice's backing array's length 
>> outside of Go - the runtime won't know about it and happily allocate over 
>> the grown slice. 
>>  
>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Accessing *[]uint64 from assembly - strange memory corruption under heavy load - any ideas?

Reply via email to