I would add to 14812 <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fgolang%2Fgo%2Fissues%2F14812&sa=D&sntz=1&usg=AFQjCNEO0CmkbsFu1mCIIpvASd-f7Z98tA>. The report should include the environment variables, HW , and RAM. The report should indicate if any environment variables are not set to the default, such as GOGC (SetGCPercent).
On Monday, July 24, 2017 at 8:44:10 PM UTC-4, stb...@laserlike.com wrote: > > Hi, > > We are experiencing a problem that I believe may be related to issue 14812 > <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fgolang%2Fgo%2Fissues%2F14812&sa=D&sntz=1&usg=AFQjCNEO0CmkbsFu1mCIIpvASd-f7Z98tA> > > but I wanted to ask here before adding to that case or filing a new issue. > Of course, we’d also greatly appreciate any advice about how to make our > program performant. > > Here is what we observe: at Laserlike one of our core user-facing services > (the “leaf”) typically responds to a particular rpc in <400ms. During GC > we see spikes in latency to >5s on some simple requests. The stop the > world periods are short, so the GC spikes appear to happen at other times. > > We have been unable to replicate this in a simple program, but we did run > our code in a test mode that repros it. In our test environment the server > loads ~10 GB of persistent data (never deleted so doesn’t really need to be > GCed), and we ask for 8 processors. We are running go version 1.8.3 on > kubernetes on GCP machine of type n1-highmem-64. To create the problem we > send the server a single request with >500 tasks.. > > > This google drive folder has leaf-logs-full.redacted.txt as well as other > performance tooling files > <https://drive.google.com/drive/folders/0BypqYg6r4ebqb0lQNnM5eWtISms?usp=sharing>. > > A snippet from that log here shows normal query responses and timing: > > I0719 22:50:22.467367 leaf.go:363] Worker #5 done search for '[redacted]', > took 0.013 seconds > > I0719 22:50:22.467406 leaf.go:225] Worker #5 starting search for > '[redacted]' > > I0719 22:50:22.467486 leaf.go:363] Worker #6 done search for '[redacted]', > took 0.001 seconds > > I0719 22:50:22.467520 leaf.go:225] Worker #6 starting search for > '[redacted]' > > I0719 22:50:22.468050 leaf.go:363] Worker #9 done search for '[redacted]', > took 0.005 seconds > > > We have observed that if a GC cycle happens to start while serving traffic > (which is often) AND there is a large amount of time spent in assist, then > our serving latency skyrockets by 10x. In the log the slowdown commences > roughly when pacer assist starts at I0719 22:50:31.079283 and then reverts > to normal latencies shortly after the gc cycle completes at I0719 > 22:50:36.806085. > > Below I copy parts of the log where we see latencies of up to 729ms on > tasks. I also bold the line that shows 32929ms spent on alloc gc assist. > > We have captured an attached cpu profile during this time which seems to > confirm a large amount of time spent in runtime.gcAssistAlloc.func1. > > > Pardon our ignorance about GC in golang, but our hypothesis about what may > be going wrong is that our large in-memory data structures are causing gc > to often go into assist mode, and that for reasons we don’t understand > malloc becomes expensive in that mode. Since we also create ~100k new data > objects when processing user requests, we are guessing those allocs become > very slow. Another piece of evidence for this hypothesis is that we have > another (prototype) implementation of this kind of service that makes more > use of object pools and doesn’t seem to have as much of slowdown during GC. > > Note on large in-memory data-structures: > > The principal data structures can be thought of as: > > Map[uint64][]byte (about 10M map entries, the slice lengths between 5K to > 50K) (around ~10G total memory usage) > > Map[uint64][]uint64 (about 50M map entries, the slice lengths vary between > 10 and 100K, in a zipfian distribution, about 3G total memory usage) > > These data structures mostly stay as is over the life of the program. > > We are trying to figure out how to solve this so would appreciate any > advice. An engineer on our team wrote up the following ideas, none of which > are that great: > > 1. > > Figure out a simple way to prevent our goroutines slowing down during > GC. I had some hopes LockOSThread() could be made to work, but it didn't > seem to help in my experiments. I'm not ruling this solution out > entirely, > but if it's the write barriers that are the main problem, I don't have > much > hope. > 2. > > Run at least 2 replicas of all our servers. Manage their GC cycles > ourselves, synchronized so that at most one replica is in GC at any given > time. The clients should either send all requests to both replicas (and > cancel when one replies), or use some more complicated Kubernetes and > client logic so a GCing replica is never sent requests. This is the > simplest solution that is likely to work, and doesn't require us to change > our habits too much. It just costs more. :) > 3. > > Refactor our current servers and program future servers with the > explicit goal of reducing GC burden. Give our servers more memory and > increase SetGCPercent() so that Garbage Collection happens less > frequently. > Use simpler data structures with fewer pointers to reduce the length of > time the GC cycle lasts when it does run. This isn't a perfect solution, > because the GC will still hit us on occasion, but it reduces the harm. > 4. > > Throw out much of the convenience of Go and write our own > allocators/shared pointers that index into a flat byte buffer, making no > use of the GC. (Either do this for particular objects, or just write a > general solution.) Make GC cycles so infrequent that we can either turn > GC > off completely without going OOM, or at least run it only once per day, > during off hours. This seems very hard, and we'll be fighting with Go a > lot. > 5. > > Implement the core serving data structures of our service in C++, > paired off with another server written in Go that manages populating that > server’s data -- the Go server would manage the data in the C++ server via > rpc, and query it over rpc. > 6. > > The nuclear option: decide that Go is simply not a suitable language > for our serving stack. Rewrite major servers in C++. > > > > Right now #5 is our (terrible but only feasible-seeming) choice. > > Appreciate any advice! > > Steve Baker > > > > > Logs with gc log lines: > > > pacer: assist ratio=+3.155887e+001 (scan 2669 MB in 10072->10157 MB) > workers=2+0 > > I0719 22:50:31.097707 leaf.go:363] Worker #3 done search for '[redacted]', > took 0.017 seconds > > I0719 22:50:31.098036 leaf.go:225] Worker #3 starting search for > '[redacted]' > > I0719 22:50:31.133959 leaf.go:363] Worker #6 done search for '[redacted]', > took 0.069 seconds > > I0719 22:50:31.134300 leaf.go:225] Worker #6 starting search for > '[redacted]' > > I0719 22:50:31.142191 leaf.go:363] Worker #9 done search for '[redacted]', > took 0.066 seconds > > I0719 22:50:31.142275 leaf.go:225] Worker #9 starting search for > '[redacted]' > > I0719 22:50:31.173921 leaf.go:363] Worker #10 done search for > '[redacted]', took 0.098 seconds > > I0719 22:50:31.174009 leaf.go:225] Worker #10 starting search for > '[redacted]' > > I0719 22:50:31.328796 leaf.go:363] Worker #15 done search for > '[redacted]', took 0.260 seconds > > I0719 22:50:31.330194 leaf.go:225] Worker #15 starting search for > '[redacted]' > > I0719 22:50:31.395579 leaf.go:363] Worker #11 done search for > '[redacted]', took 0.374 seconds > > > <SNIP> > > I0719 22:50:36.317004 leaf.go:363] Worker #5 done search for '[redacted]', > took 0.729 seconds > > I0719 22:50:36.317190 leaf.go:225] Worker #5 starting search for > '[redacted]' > > I0719 22:50:36.370191 leaf.go:363] Worker #15 done search for > '[redacted]', took 0.192 seconds > > I0719 22:50:36.371446 leaf.go:225] Worker #15 starting search for > '[redacted]' > > I0719 22:50:36.421953 leaf.go:363] Worker #7 done search for '[redacted]', > took 0.116 seconds > > I0719 22:50:36.422092 leaf.go:225] Worker #7 starting search for > '[redacted]' > > I0719 22:50:36.570008 leaf.go:363] Worker #12 done search for > '[redacted]', took 0.778 seconds > > I0719 22:50:36.572970 leaf.go:225] Worker #12 starting search for > '[redacted]' > > I0719 22:50:36.573571 leaf.go:363] Worker #6 done search for '[redacted]', > took 0.710 seconds > > I0719 22:50:36.573721 leaf.go:225] Worker #6 starting search for > '[redacted]' > > pacer: H_m_prev=8875421544 h_t=+1.900000e-001 H_T=10561751637 > h_a=+1.983366e-001 H_a=10635742360 h_g=+2.000000e-001 H_g=10650505852 > u_a=+9.735779e-001 u_g=+2.500000e-001 W_a=2354135120 goalΔ=+1.000000e-002 > actualΔ=+8.336587e-003 u_a/u_g=+3.894312e+000 > > gc 1072 @4337.737s 39%: 0.37+5688+0.32 ms clock, 2.2+32929/11310/21+1.9 ms > cpu, 10072->10143->8661 MB, 10157 MB goal, 8 P > > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.