YottaDB is a key-value NoSQL databaase with a C API. To make it more friendly to Golang users, we have created a Go wrapper (see https://godoc.org/lang.yottadb.com/go/yottadb). A Rust wrapper is in development, with other languages to follow.
With the Golang wrapper, we have run into a variety of panics and crashes that take several forms. Those forms in order of frequency are as follows: - Panics due to “sweep increased allocation count”. - Panics due to invalid address (bad pointer) in Go heap. - SIG-11 failures in various places usually dealing with garbage collection. - Panics due to “unknown return address” when trying to return from functions. - Mutex deadlocks (everyone waiting and nobody holding) and HUGE memory usage. Researching these issues notes that the issue is “usually” either due to bad cgo code or “race” conditions. The problems get MUCH worse when GOGC is set to 1. Running with either of the options “-msan” or “-race” has not flagged any issues. I am using Golang 1.12.5 on Ubuntu 18.04. A bit of background: The C API that we are wrapping has parameters that are pointers to structures that themselves contain pointers to information. Consequently, we decided to use Golang structures mostly as anchor points to C allocation memory holding the actual structures since cgo does not allow Golang structures passed to C to have pointers to Golang memory. Since Golang is not a language where a coder needs to track memory and release it, we make use of Finalizers when the C storage is allocated so when the base Golang structure is garbage collected, a function is run that releases the allocated C memory attached to it. So far so good. Here is the basic C structure that this code uses (C.ydb_buffer_t): typedef struct{ unsigned int len_alloc; unsigned int len_used; char *buf_addr;} ydb_buffer_t; Here are the related Golang structures: type BufferT struct { cbuft *internalBufferT // Allows BufferT to be copied via assignment ownsBuff bool // If true, we should clean the cbuft when Free'd} type internalBufferT struct { cbuft *C.ydb_buffer_t // C flavor of the ydb_buffer_t struct} type BufferTArray struct { cbuftary *internalBufferTArray} type internalBufferTArray struct { elemUsed uint32 // Number of elements used elemAlloc uint32 // Number of elements in array cbuftary *[]C.ydb_buffer_t // C flavor of ydb_buffer_t array} The internal variant of each of BufferT and BufferTArray is so the BufferT/BufferTArray can be copied by an assignment statement without double free issues or using of storage after free, etc. It is technique we found was suggested to deal with this issue as was use of the Finalizer to cause the internal structure to release its C storage which involves not only releasing the C control block but also the buffer that it points to – and in the case of the array, the same for each element in the array. Our API necessarily does a lot with cgo and due to the array of C structures (and cgo not supporting arrays or variadic plists – both of which we require), has to do its own pointer arithmetic to access the element we want when dealing with individual array elements anchored by internalBufferTArray. We have a similar issue with calling one of the C API calls (ydb_lock_st()) because it takes a variadic plist and again, cgo doesn’t support variadic plists. So we’ve cooked up a C function that takes an array of uintptr values and turns it into a variadic plist call. So again, we’re loading up elements of this parameter array using pointer arithmetic and converting those pointers to uinptr values which seems to put them on GC’s radar if the go structures are not subsequently used. One issue we found and solved was with this latter usage. The scenario was this: 1. Function call created a key that contained a variable name and two subscripts that were passed in to the LockST() wrapper call. Each of these was ultimately housed in a ydb_buffer_t structure though the subscripts are in the array version. 2. The LockST() code looked at the structures and pulled out the C pointers to the ydb_buffer_t structures then turning those addresses into uintptr to store in the parameter array. 3. At this point, it seems, since there were no more references to the passed-in Golang structures and the C pointers we had pulled out of them had been morphed to uintptr values, that the Golang structures became eligible for garbage collection. But because there was a Finalizer set for these structures, *the C storage was released when we were not finished with it yet – in fact had not even made the call yet to the C routine.* 4. We got errors in the C code and saw that though the structures were well-formed in the Go routine, the stuff that ended up being passed in was complete garbage. 5. We “fixed” this issue by putting a meaningless reference to the golang variadic list that held all these values just prior to returning from the routine which kept them from being garbage collected such that it ran successfully. That fixed one of around 6 different failure types we were having. The test that mostly generates the failures (though others do it too) is where we are launching 10 goroutines that each do random operation types (set, get, lock, delete, next, previous, etc) for a set period of time, then exit. It is only when launching goroutines simultaneously that we see these errors. If we launch workers as separate processes, we do not see these issues. We also do not see these issues when the driver program is C instead of Golang. The main C code base can be found at https://gitlab.com/YottaDB/DB/YDB (master branch) while the go wrapper can be found at https://gitlab.com/YottaDB/Lang/YDBGo (develop branch) – note LockST() is in simple_api.go and you can see the added reference “fix” at the end of it. But quite simply, nearly every call in this API has failed in one way or another with one of the symptoms mentioned above. Is there some better way to do this? Should we be doing something different to escape the Finalizer/GC race to eliminate storage while we’re still using pointers to that storage? So far, we’ve been dealing with best-guesses as we do not know Golang’s internals. Although I’ve read some articles on Golang’s garbage collection, the best-guess I’ve been able to come up with is we have a race between our usage of C and Golang’s eagerness to GC the main structures which then pulls the C memory out from under us while we are using it. Note this gets slightly murkier when one understands that one of the API calls (TpST()) initiates a transaction and then calls back into a given Go routine to actually execute the transaction. And indeed some of the cores we’ve seen show us that we are either in or just coming out of a transaction so the transaction itself garbage collected things we are still using – or perhaps its usage just reused some released storage from a previously executed transaction. <head spins> But even cutting down that test so that only one request type runs is getting both the “bad pointer in Go heap” and the “sweep increased allocation count” errors when run in 10 goroutines. Here are the tops of some actual failures. I’ve not included the entire failure as it references a lot of different threads so it fairly long but let me know if you want to see them: runtime: nelems=21 nalloc=5 previous allocCount=4 nfreed=65535fatal error: sweep increased allocation count goroutine 38 [running]:runtime.throw(0x5288be, 0x20) /snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0xc000157ab8 sp=0xc000157a88 pc=0x4395c2runtime.(*mspan).sweep(0x7fae4f5d7258, 0x4c0600, 0x118e800) /snap/go/3739/src/runtime/mgcsweep.go:329 +0x8ef fp=0xc000157ba0 sp=0xc000157ab8 pc=0x43015fruntime.sweepone(0x0) /snap/go/3739/src/runtime/mgcsweep.go:136 +0x282 fp=0xc000157c08 sp=0xc000157ba0 pc=0x42f612runtime.deductSweepCredit(0x2000, 0x0) /snap/go/3739/src/runtime/mgcsweep.go:438 +0x80 fp=0xc000157c30 sp=0xc000157c08 pc=0x4302c0runtime.(*mcentral).cacheSpan(0x7f5460, 0x4143f5) /snap/go/3739/src/runtime/mcentral.go:43 +0x61 fp=0xc000157c90 sp=0xc000157c30 pc=0x424fa1runtime.(*mcache).refill(0x7fae4f5cc6d0, 0x4) /snap/go/3739/src/runtime/mcache.go:135 +0x86 fp=0xc000157cb0 sp=0xc000157c90 pc=0x424cd6runtime.(*mcache).nextFree(0x7fae4f5cc6d0, 0x118ea04, 0x448441, 0x4d8420, 0x2582c18) /snap/go/3739/src/runtime/malloc.go:786 +0x88 fp=0xc000157ce8 sp=0xc000157cb0 pc=0x41b148runtime.mallocgc(0x10, 0x502040, 0x4c6e01, 0x44db90) /snap/go/3739/src/runtime/malloc.go:939 +0x7a8 fp=0xc000157d98 sp=0xc000157ce8 pc=0x41ba98runtime.growslice(0x502040, 0x0, 0x0, 0x0, 0x1, 0xc000157e98, 0x0, 0x7fae240015a0) /snap/go/3739/src/runtime/slice.go:181 +0x228 fp=0xc000157e00 sp=0xc000157d98 pc=0x44ddb8main.genName(0x4c76a8, 0x0, 0x448441) /home/estess/go/src/rwalk4/randomWalk.go:66 +0x2d3 fp=0xc000157eb8 sp=0xc000157e00 pc=0x4c6e13main.runProc(0x0, 0x0, 0xa, 0x3fc999999999999a, 0x0, 0x0) /home/estess/go/src/rwalk4/randomWalk.go:82 +0x45 fp=0xc000157f78 sp=0xc000157eb8 pc=0x4c7145main.main.func2(0xc0000b8020, 0xc0000b8028, 0x0, 0xc0000b8010) /home/estess/go/src/rwalk4/randomWalk.go:141 +0x68 fp=0xc000157fc0 sp=0xc000157f78 pc=0x4c76a8runtime.goexit() /snap/go/3739/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000157fc8 sp=0xc000157fc0 pc=0x464671created by main.main /home/estess/go/src/rwalk4/randomWalk.go:133 +0x178 The other type of failure: runtime: pointer 0xc000202648 to unallocated span span.base()=0xc000200000 span.limit=0xc000204000 span.state=0runtime: found in object at *(0xc000137840+0x10)object=0xc000137840 s.base()=0xc000136000 s.limit=0xc000138000 s.spanclass=10 s.elemsize=64 s.state=mSpanInUse *(object+0) = 0x524d82 *(object+8) = 0xa *(object+16) = 0xc000202648 <== *(object+24) = 0x4 *(object+32) = 0xc000202658 *(object+40) = 0x4 *(object+48) = 0xc000202668 *(object+56) = 0x4fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?) runtime stack:runtime.throw(0x52b6cc, 0x3e) /snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0x7f593affcd00 sp=0x7f593affccd0 pc=0x4395c2runtime.findObject(0xc000202648, 0xc000137840, 0x10, 0x0, 0x0, 0x0) /snap/go/3739/src/runtime/mbitmap.go:397 +0x3bd fp=0x7f593affcd50 sp=0x7f593affcd00 pc=0x42237druntime.scanobject(0xc000137840, 0xc00003a670) /snap/go/3739/src/runtime/mgcmark.go:1174 +0x240 fp=0x7f593affcde0 sp=0x7f593affcd50 pc=0x42dc20runtime.gcDrain(0xc00003a670, 0x7) /snap/go/3739/src/runtime/mgcmark.go:932 +0x209 fp=0x7f593affce38 sp=0x7f593affcde0 pc=0x42d3f9runtime.gcBgMarkWorker.func2() /snap/go/3739/src/runtime/mgc.go:1926 +0x16c fp=0x7f593affce78 sp=0x7f593affce38 pc=0x46018cruntime.systemstack(0x0) /snap/go/3739/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f593affce80 sp=0x7f593affce78 pc=0x462616runtime.mstart() /snap/go/3739/src/runtime/proc.go:1153 fp=0x7f593affce88 sp=0x7f593affce80 pc=0x43db80 goroutine 82 [GC worker (idle)]:runtime.systemstack_switch() /snap/go/3739/src/runtime/asm_amd64.s:311 fp=0xc00004df60 sp=0xc00004df58 pc=0x4625a0runtime.gcBgMarkWorker(0xc000039400) /snap/go/3739/src/runtime/mgc.go:1890 +0x1be fp=0xc00004dfd8 sp=0xc00004df60 pc=0x429aferuntime.goexit() /snap/go/3739/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00004dfe0 sp=0xc00004dfd8 pc=0x464671created by runtime.gcBgMarkStartWorkers /snap/go/3739/src/runtime/mgc.go:1784 +0x77 Running these tests is a bit complicated. We are working on putting a docker container together that has the runtime all setup so all that is needed to fire up the test. If desired, this will be made available. Anyway, any insights as to how to make progress with this issue would be most appreciated. I’ve tried to keep this as brief as possible. We don’t understand what the Golang errors are telling us. So if further explanations somewhere would help, please do let us know. Steve -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/8d879a7a-8031-4a6a-93d8-5960a4d448c2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.