I am interested in having less overhead for Go-C-Go roundtrips, for C programs that I know (or at least am very much sure) will behave, for most part, similar to non-preemptible (as in prior to Go 1.14) loops in Go. Concretely, I have a self-imposed exercise of making a frankenprogram that talks graphics using Vulkan (this part is in C) and talks to other computers over network (this one is in Go).
Background Certain sizable class of Vulkan programs spend overwhelming majority of their CPU time in a loop of recording command buffers. This means calling vkCmd* functions some 10000 to 100000 or more times per second. The frequency with which these functions are invoked makes it infeasible to use the common Cgo mechanism because its constant overhead becomes significantly larger than the time these functions individually run for. In fact, it is likely that most of vkCmd* these programs are interested in just merely copy their arguments to an array and bump a counter. This property makes the assumption of Cgo that a C program may block redundant. But this is an implementation detail of a Vulkan driver and differs between drivers (we still assume that driver is good and won't be doing nasty things like hanging forever). Even if we knew memory layout of command buffer and replicated in Go relevant vkCmd* functions from, say, mesa radv, this will be broken by an inevitable driver update (the .so part of driver is overwhelmingly common to be linked dynamically) and will not work on a different driver such as mesa anvil (intel vulkan driver). There is also a number of minor nuisances such as dynamic cgocheck being constantly angry at pointers to Go memory being sent to C (runtime.KeepAlives were carefully placed in the program) and general discomfort of writing "Go-looking C" in Go. I suspect there's an alternative way to making graphics card do work by means of using indirect draw similar to OpenGL AZDO approach (this is a speculation, since I'm not at all familiar with this approach). This lets us to just write indirect draw commands to a large array from Go and make calling into C much less frequent. This approach appears to come at big cost of ergonomics: we have no way of interleaving binding of anything with the draw calls. We would need a separate logical buffer (offset in vkCmdDraw*Indirect) with indirect draw commands between any two vkCmdBind* calls. Another approach would be to have intermediate command buffer of our own which would be, for example, an array of tagged unions (structs with an integer tag and a union) describing which vkCmd* to call and what parameters to give it. This hoop-jumping motivates us to write a considerable partition of program in C, which is not necessarily the thing we desire, which in turn motivates us to find ways to call classes of C programs that we hope will not misbehave with overhead that is significantly less than that of Cgo. rustgo: calling Rust from Go with near-zero overhead I recalled stumbling upon https://blog.filippo.io/rustgo/ and reread the article. The approach described is simple but extremely dangerous for opaque C calls (which is what vkCmd* are). We have no idea about how much stack the C function is going to need and if we had to make a conservative guess, we would still want to have a stack guard at the bottom, to be safe. https://github.com/minio/c2goasm lets us do what is described in the article in a less involved manner but (as far as I understood) assumes that the C function is a leaf function. I never tried installing stack guards at goroutine stacks in Go but in my own exercise of re-implementing rsc's libtask I tried mprotecting 4k at the bottom of the stack to PROT_READ. I noted that I couldn't have more than about 15k tasks. This is due to the default limit of about 32k of memory maps in linux. mprotecting PROT_READ and then restoring protection at each context switch removed this handicap but slowed down context switching considerably. In Go, I imagine, stack guard for C functions could be done as follows: Withguard(func() { // C functions are called here in a way described in rustgo article }) where Withguard would ask for large morestack + 8k at bottom, install PROT_READ page somewhere in the 8k part near the bottom of the stack, call the function passed to it, munprotect and leave. But I speculate this would lead to deadly interactions with GC such as ocassional segfaults whenever GC would for whatever unknown reason access the PROT_READ part of the stack. runtime.systemstack I stumbled upon this function when exploring Go runtime. It lets me achieve things I would want to make Withguard for. It has an important caveat: while we're on systemstack, we may not be preempted (also no defers). This means that we probably should not allocate anything in fear of dangerous interactions with GC. We also should probably leave systemstack sooner, because I suspect GC will not be able to see roots of the goroutine that switched to systemstack and will lead to issues similar to those with long running memclrs and memmoves. For now I have stopped my exploration of Go's internals and I am only very little familiar with Go 1.14's preemption mechanism but I still have two questions urging an answer left: 1) I have heard gccgo can call C much quicker than the standard go implementation can. If this statement is true, why is that? 2) How could a systemstack-like function but which can be friends with preemption mechanism of Go 1.14 be achieved? -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/3898a57c-eb55-441f-8b4e-92e332273883%40googlegroups.com.