YottaDB is a key-value NoSQL databaase with a C API. To make it more 
friendly to Golang users, we have created a Go wrapper (see 
https://godoc.org/lang.yottadb.com/go/yottadb). A Rust wrapper is in 
development, with other languages to follow.

With the Golang wrapper, we have run into a variety of panics and crashes 
that take several forms. Those forms in order of frequency are as follows:

   - 
   
   Panics due to “sweep increased allocation count”.
   - 
   
   Panics due to invalid address (bad pointer) in Go heap.
   - 
   
   SIG-11 failures in various places usually dealing with garbage 
   collection.
   - 
   
   Panics due to “unknown return address” when trying to return from 
   functions.
   - 
   
   Mutex deadlocks (everyone waiting and nobody holding) and HUGE memory 
   usage.
   
Researching these issues notes that the issue is “usually” either due to 
bad cgo code or “race” conditions. The problems get MUCH worse when GOGC is 
set to 1. Running with either of the options “-msan” or “-race” has not 
flagged any issues. I am using Golang 1.12.5 on Ubuntu 18.04.

A bit of background:

The C API that we are wrapping has parameters that are pointers to 
structures that themselves contain pointers to information. Consequently, 
we decided to use Golang structures mostly as anchor points to C allocation 
memory holding the actual structures since cgo does not allow Golang 
structures passed to C to have pointers to Golang memory. Since Golang is 
not a language where a coder needs to track memory and release it, we make 
use of Finalizers when the C storage is allocated so when the base Golang 
structure is garbage collected, a function is run that releases the 
allocated C memory attached to it. So far so good.

Here is the basic C structure that this code uses (C.ydb_buffer_t):

typedef struct{ unsigned int    len_alloc;      unsigned int    len_used;       
char            *buf_addr;} ydb_buffer_t;


Here are the related Golang structures:

type BufferT struct {   cbuft    *internalBufferT // Allows BufferT to be 
copied via assignment ownsBuff bool             // If true, we should clean the 
cbuft when Free'd}
type internalBufferT struct {   cbuft *C.ydb_buffer_t     // C flavor of the 
ydb_buffer_t struct}
type BufferTArray struct {      cbuftary *internalBufferTArray}
type internalBufferTArray struct {      elemUsed  uint32            // Number 
of elements used  elemAlloc uint32            // Number of elements in array    
  cbuftary  *[]C.ydb_buffer_t // C flavor of ydb_buffer_t array}


The internal variant of each of BufferT and BufferTArray is so the 
BufferT/BufferTArray can be copied by an assignment statement without 
double free issues or using of storage after free, etc. It is technique we 
found was suggested to deal with this issue as was use of the Finalizer to 
cause the internal structure to release its C storage which involves not 
only releasing the C control block but also the buffer that it points to – 
and in the case of the array, the same for each element in the array.

Our API necessarily does a lot with cgo and due to the array of C 
structures (and cgo not supporting arrays or variadic plists – both of 
which we require), has to do its own pointer arithmetic to access the 
element we want when dealing with individual array elements anchored by 
internalBufferTArray. We have a similar issue with calling one of the C API 
calls (ydb_lock_st()) because it takes a variadic plist and again, cgo 
doesn’t support variadic plists. So we’ve cooked up a C function that takes 
an array of uintptr values and turns it into a variadic plist call. So 
again, we’re loading up elements of this parameter array using pointer 
arithmetic and converting those pointers to uinptr values which seems to 
put them on GC’s radar if the go structures are not subsequently used.

One issue we found and solved was with this latter usage. The scenario was 
this:

   1. 
   
   Function call created a key that contained a variable name and two 
   subscripts that were passed in to the LockST() wrapper call. Each of these 
   was ultimately housed in a ydb_buffer_t structure though the subscripts are 
   in the array version.
   2. 
   
   The LockST() code looked at the structures and pulled out the C pointers 
   to the ydb_buffer_t structures then turning those addresses into uintptr to 
   store in the parameter array.
   3. 
   
   At this point, it seems, since there were no more references to the 
   passed-in Golang structures and the C pointers we had pulled out of them 
   had been morphed to uintptr values, that the Golang structures became 
   eligible for garbage collection. But because there was a Finalizer set for 
   these structures, *the C storage was released when we were not finished 
   with it yet – in fact had not even made the call yet to the C routine.*
   4. 
   
   We got errors in the C code and saw that though the structures were 
   well-formed in the Go routine, the stuff that ended up being passed in was 
   complete garbage.
   5. 
   
   We “fixed” this issue by putting a meaningless reference to the golang 
   variadic list that held all these values just prior to returning from the 
   routine which kept them from being garbage collected such that it ran 
   successfully.
   
That fixed one of around 6 different failure types we were having. The test 
that mostly generates the failures (though others do it too) is where we 
are launching 10 goroutines that each do random operation types (set, get, 
lock, delete, next, previous, etc) for a set period of time, then exit. It 
is only when launching goroutines simultaneously that we see these errors. 
If we launch workers as separate processes, we do not see these issues. We 
also do not see these issues when the driver program is C instead of Golang.

The main C code base can be found at https://gitlab.com/YottaDB/DB/YDB 
(master branch) while the go wrapper can be found at 
https://gitlab.com/YottaDB/Lang/YDBGo (develop branch) – note LockST() is 
in simple_api.go and you can see the added reference “fix” at the end of 
it. But quite simply, nearly every call in this API has failed in one way 
or another with one of the symptoms mentioned above. 

Is there some better way to do this? Should we be doing something different 
to escape the Finalizer/GC race to eliminate storage while we’re still 
using pointers to that storage? So far, we’ve been dealing with 
best-guesses as we do not know Golang’s internals. Although I’ve read some 
articles on Golang’s garbage collection, the best-guess I’ve been able to 
come up with is we have a race between our usage of C and Golang’s 
eagerness to GC the main structures which then pulls the C memory out from 
under us while we are using it.

Note this gets slightly murkier when one understands that one of the API 
calls (TpST()) initiates a transaction and then calls back into a given Go 
routine to actually execute the transaction. And indeed some of the cores 
we’ve seen show us that we are either in or just coming out of a 
transaction so the transaction itself garbage collected things we are still 
using – or perhaps its usage just reused some released storage from a 
previously executed transaction. <head spins> But even cutting down that 
test so that only one request type runs is getting both the “bad pointer in 
Go heap” and the “sweep increased allocation count” errors when run in 10 
goroutines.

Here are the tops of some actual failures. I’ve not included the entire 
failure as it references a lot of different threads so it fairly long but 
let me know if you want to see them:

runtime: nelems=21 nalloc=5 previous allocCount=4 nfreed=65535fatal error: 
sweep increased allocation count
goroutine 38 [running]:runtime.throw(0x5288be, 0x20)    
/snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0xc000157ab8 sp=0xc000157a88 
pc=0x4395c2runtime.(*mspan).sweep(0x7fae4f5d7258, 0x4c0600, 0x118e800)     
/snap/go/3739/src/runtime/mgcsweep.go:329 +0x8ef fp=0xc000157ba0 
sp=0xc000157ab8 pc=0x43015fruntime.sweepone(0x0)       
/snap/go/3739/src/runtime/mgcsweep.go:136 +0x282 fp=0xc000157c08 
sp=0xc000157ba0 pc=0x42f612runtime.deductSweepCredit(0x2000, 0x0)      
/snap/go/3739/src/runtime/mgcsweep.go:438 +0x80 fp=0xc000157c30 sp=0xc000157c08 
pc=0x4302c0runtime.(*mcentral).cacheSpan(0x7f5460, 0x4143f5)    
/snap/go/3739/src/runtime/mcentral.go:43 +0x61 fp=0xc000157c90 sp=0xc000157c30 
pc=0x424fa1runtime.(*mcache).refill(0x7fae4f5cc6d0, 0x4) 
/snap/go/3739/src/runtime/mcache.go:135 +0x86 fp=0xc000157cb0 sp=0xc000157c90 
pc=0x424cd6runtime.(*mcache).nextFree(0x7fae4f5cc6d0, 0x118ea04, 0x448441, 
0x4d8420, 0x2582c18)   /snap/go/3739/src/runtime/malloc.go:786 +0x88 
fp=0xc000157ce8 sp=0xc000157cb0 pc=0x41b148runtime.mallocgc(0x10, 0x502040, 
0x4c6e01, 0x44db90)   /snap/go/3739/src/runtime/malloc.go:939 +0x7a8 
fp=0xc000157d98 sp=0xc000157ce8 pc=0x41ba98runtime.growslice(0x502040, 0x0, 
0x0, 0x0, 0x1, 0xc000157e98, 0x0, 0x7fae240015a0)    
/snap/go/3739/src/runtime/slice.go:181 +0x228 fp=0xc000157e00 sp=0xc000157d98 
pc=0x44ddb8main.genName(0x4c76a8, 0x0, 0x448441)  
/home/estess/go/src/rwalk4/randomWalk.go:66 +0x2d3 fp=0xc000157eb8 
sp=0xc000157e00 pc=0x4c6e13main.runProc(0x0, 0x0, 0xa, 0x3fc999999999999a, 0x0, 
0x0) /home/estess/go/src/rwalk4/randomWalk.go:82 +0x45 fp=0xc000157f78 
sp=0xc000157eb8 pc=0x4c7145main.main.func2(0xc0000b8020, 0xc0000b8028, 0x0, 
0xc0000b8010)     /home/estess/go/src/rwalk4/randomWalk.go:141 +0x68 
fp=0xc000157fc0 sp=0xc000157f78 pc=0x4c76a8runtime.goexit()  
/snap/go/3739/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000157fc8 sp=0xc000157fc0 
pc=0x464671created by main.main /home/estess/go/src/rwalk4/randomWalk.go:133 
+0x178



The other type of failure:

runtime: pointer 0xc000202648 to unallocated span span.base()=0xc000200000 
span.limit=0xc000204000 span.state=0runtime: found in object at 
*(0xc000137840+0x10)object=0xc000137840 s.base()=0xc000136000 
s.limit=0xc000138000 s.spanclass=10 s.elemsize=64 s.state=mSpanInUse
 *(object+0) = 0x524d82
 *(object+8) = 0xa
 *(object+16) = 0xc000202648 <==
 *(object+24) = 0x4
 *(object+32) = 0xc000202658
 *(object+40) = 0x4
 *(object+48) = 0xc000202668
 *(object+56) = 0x4fatal error: found bad pointer in Go heap (incorrect use of 
unsafe or cgo?)
runtime stack:runtime.throw(0x52b6cc, 0x3e)     
/snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0x7f593affcd00 
sp=0x7f593affccd0 pc=0x4395c2runtime.findObject(0xc000202648, 0xc000137840, 
0x10, 0x0, 0x0, 0x0) /snap/go/3739/src/runtime/mbitmap.go:397 +0x3bd 
fp=0x7f593affcd50 sp=0x7f593affcd00 pc=0x42237druntime.scanobject(0xc000137840, 
0xc00003a670)   /snap/go/3739/src/runtime/mgcmark.go:1174 +0x240 
fp=0x7f593affcde0 sp=0x7f593affcd50 pc=0x42dc20runtime.gcDrain(0xc00003a670, 
0x7)      /snap/go/3739/src/runtime/mgcmark.go:932 +0x209 fp=0x7f593affce38 
sp=0x7f593affcde0 pc=0x42d3f9runtime.gcBgMarkWorker.func2()   
/snap/go/3739/src/runtime/mgc.go:1926 +0x16c fp=0x7f593affce78 
sp=0x7f593affce38 pc=0x46018cruntime.systemstack(0x0)    
/snap/go/3739/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f593affce80 
sp=0x7f593affce78 pc=0x462616runtime.mstart() 
/snap/go/3739/src/runtime/proc.go:1153 fp=0x7f593affce88 sp=0x7f593affce80 
pc=0x43db80
goroutine 82 [GC worker (idle)]:runtime.systemstack_switch()    
/snap/go/3739/src/runtime/asm_amd64.s:311 fp=0xc00004df60 sp=0xc00004df58 
pc=0x4625a0runtime.gcBgMarkWorker(0xc000039400)       
/snap/go/3739/src/runtime/mgc.go:1890 +0x1be fp=0xc00004dfd8 sp=0xc00004df60 
pc=0x429aferuntime.goexit()        /snap/go/3739/src/runtime/asm_amd64.s:1337 
+0x1 fp=0xc00004dfe0 sp=0xc00004dfd8 pc=0x464671created by 
runtime.gcBgMarkStartWorkers      /snap/go/3739/src/runtime/mgc.go:1784 +0x77

Running these tests is a bit complicated. We are working on putting a 
docker container together that has the runtime all setup so all that is 
needed to fire up the test. If desired, this will be made available.

Anyway, any insights as to how to make progress with this issue would be 
most appreciated. I’ve tried to keep this as brief as possible. We don’t 
understand what the Golang errors are telling us. So if further 
explanations somewhere would help, please do let us know.

Steve

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/8d879a7a-8031-4a6a-93d8-5960a4d448c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to