Hi,
We are experiencing a problem that I believe may be related to issue 14812 <https://github.com/golang/go/issues/14812> but I wanted to ask here before adding to that case or filing a new issue. Of course, we’d also greatly appreciate any advice about how to make our program performant. Here is what we observe: at Laserlike one of our core user-facing services (the “leaf”) typically responds to a particular rpc in <400ms. During GC we see spikes in latency to >5s on some simple requests. The stop the world periods are short, so the GC spikes appear to happen at other times. We have been unable to replicate this in a simple program, but we did run our code in a test mode that repros it. In our test environment the server loads ~10 GB of persistent data (never deleted so doesn’t really need to be GCed), and we ask for 8 processors. We are running go version 1.8.3 on kubernetes on GCP machine of type n1-highmem-64. To create the problem we send the server a single request with >500 tasks.. This google drive folder has leaf-logs-full.redacted.txt as well as other performance tooling files <https://drive.google.com/drive/folders/0BypqYg6r4ebqb0lQNnM5eWtISms?usp=sharing>. A snippet from that log here shows normal query responses and timing: I0719 22:50:22.467367 leaf.go:363] Worker #5 done search for '[redacted]', took 0.013 seconds I0719 22:50:22.467406 leaf.go:225] Worker #5 starting search for '[redacted]' I0719 22:50:22.467486 leaf.go:363] Worker #6 done search for '[redacted]', took 0.001 seconds I0719 22:50:22.467520 leaf.go:225] Worker #6 starting search for '[redacted]' I0719 22:50:22.468050 leaf.go:363] Worker #9 done search for '[redacted]', took 0.005 seconds We have observed that if a GC cycle happens to start while serving traffic (which is often) AND there is a large amount of time spent in assist, then our serving latency skyrockets by 10x. In the log the slowdown commences roughly when pacer assist starts at I0719 22:50:31.079283 and then reverts to normal latencies shortly after the gc cycle completes at I0719 22:50:36.806085. Below I copy parts of the log where we see latencies of up to 729ms on tasks. I also bold the line that shows 32929ms spent on alloc gc assist. We have captured an attached cpu profile during this time which seems to confirm a large amount of time spent in runtime.gcAssistAlloc.func1. Pardon our ignorance about GC in golang, but our hypothesis about what may be going wrong is that our large in-memory data structures are causing gc to often go into assist mode, and that for reasons we don’t understand malloc becomes expensive in that mode. Since we also create ~100k new data objects when processing user requests, we are guessing those allocs become very slow. Another piece of evidence for this hypothesis is that we have another (prototype) implementation of this kind of service that makes more use of object pools and doesn’t seem to have as much of slowdown during GC. Note on large in-memory data-structures: The principal data structures can be thought of as: Map[uint64][]byte (about 10M map entries, the slice lengths between 5K to 50K) (around ~10G total memory usage) Map[uint64][]uint64 (about 50M map entries, the slice lengths vary between 10 and 100K, in a zipfian distribution, about 3G total memory usage) These data structures mostly stay as is over the life of the program. We are trying to figure out how to solve this so would appreciate any advice. An engineer on our team wrote up the following ideas, none of which are that great: 1. Figure out a simple way to prevent our goroutines slowing down during GC. I had some hopes LockOSThread() could be made to work, but it didn't seem to help in my experiments. I'm not ruling this solution out entirely, but if it's the write barriers that are the main problem, I don't have much hope. 2. Run at least 2 replicas of all our servers. Manage their GC cycles ourselves, synchronized so that at most one replica is in GC at any given time. The clients should either send all requests to both replicas (and cancel when one replies), or use some more complicated Kubernetes and client logic so a GCing replica is never sent requests. This is the simplest solution that is likely to work, and doesn't require us to change our habits too much. It just costs more. :) 3. Refactor our current servers and program future servers with the explicit goal of reducing GC burden. Give our servers more memory and increase SetGCPercent() so that Garbage Collection happens less frequently. Use simpler data structures with fewer pointers to reduce the length of time the GC cycle lasts when it does run. This isn't a perfect solution, because the GC will still hit us on occasion, but it reduces the harm. 4. Throw out much of the convenience of Go and write our own allocators/shared pointers that index into a flat byte buffer, making no use of the GC. (Either do this for particular objects, or just write a general solution.) Make GC cycles so infrequent that we can either turn GC off completely without going OOM, or at least run it only once per day, during off hours. This seems very hard, and we'll be fighting with Go a lot. 5. Implement the core serving data structures of our service in C++, paired off with another server written in Go that manages populating that server’s data -- the Go server would manage the data in the C++ server via rpc, and query it over rpc. 6. The nuclear option: decide that Go is simply not a suitable language for our serving stack. Rewrite major servers in C++. Right now #5 is our (terrible but only feasible-seeming) choice. Appreciate any advice! Steve Baker Logs with gc log lines: pacer: assist ratio=+3.155887e+001 (scan 2669 MB in 10072->10157 MB) workers=2+0 I0719 22:50:31.097707 leaf.go:363] Worker #3 done search for '[redacted]', took 0.017 seconds I0719 22:50:31.098036 leaf.go:225] Worker #3 starting search for '[redacted]' I0719 22:50:31.133959 leaf.go:363] Worker #6 done search for '[redacted]', took 0.069 seconds I0719 22:50:31.134300 leaf.go:225] Worker #6 starting search for '[redacted]' I0719 22:50:31.142191 leaf.go:363] Worker #9 done search for '[redacted]', took 0.066 seconds I0719 22:50:31.142275 leaf.go:225] Worker #9 starting search for '[redacted]' I0719 22:50:31.173921 leaf.go:363] Worker #10 done search for '[redacted]', took 0.098 seconds I0719 22:50:31.174009 leaf.go:225] Worker #10 starting search for '[redacted]' I0719 22:50:31.328796 leaf.go:363] Worker #15 done search for '[redacted]', took 0.260 seconds I0719 22:50:31.330194 leaf.go:225] Worker #15 starting search for '[redacted]' I0719 22:50:31.395579 leaf.go:363] Worker #11 done search for '[redacted]', took 0.374 seconds <SNIP> I0719 22:50:36.317004 leaf.go:363] Worker #5 done search for '[redacted]', took 0.729 seconds I0719 22:50:36.317190 leaf.go:225] Worker #5 starting search for '[redacted]' I0719 22:50:36.370191 leaf.go:363] Worker #15 done search for '[redacted]', took 0.192 seconds I0719 22:50:36.371446 leaf.go:225] Worker #15 starting search for '[redacted]' I0719 22:50:36.421953 leaf.go:363] Worker #7 done search for '[redacted]', took 0.116 seconds I0719 22:50:36.422092 leaf.go:225] Worker #7 starting search for '[redacted]' I0719 22:50:36.570008 leaf.go:363] Worker #12 done search for '[redacted]', took 0.778 seconds I0719 22:50:36.572970 leaf.go:225] Worker #12 starting search for '[redacted]' I0719 22:50:36.573571 leaf.go:363] Worker #6 done search for '[redacted]', took 0.710 seconds I0719 22:50:36.573721 leaf.go:225] Worker #6 starting search for '[redacted]' pacer: H_m_prev=8875421544 h_t=+1.900000e-001 H_T=10561751637 h_a=+1.983366e-001 H_a=10635742360 h_g=+2.000000e-001 H_g=10650505852 u_a=+9.735779e-001 u_g=+2.500000e-001 W_a=2354135120 goalΔ=+1.000000e-002 actualΔ=+8.336587e-003 u_a/u_g=+3.894312e+000 gc 1072 @4337.737s 39%: 0.37+5688+0.32 ms clock, 2.2+32929/11310/21+1.9 ms cpu, 10072->10143->8661 MB, 10157 MB goal, 8 P -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.