One thing you can try is to increase the number of locks so that the goroutines aren't all stopped on a single lock. For example you can partition your stores: https://play.golang.org/p/y16yr57KcQ
type PartitionedStore struct { stores []Store } func NewPartitionedStore(sz int) *PartitionedStore { p := &PartitionedStore{ stores: make([]Store, sz), } for i := 0; i < sz; i++ { p.stores[i] = *(NewStore()) } return p } func (p *PartitionedStore) addSample(key string, value int64) { p.getStore(key).addSample(key, value) } func (p *PartitionedStore) getStore(key string) *Store { h := fnv.New32a() // might want to use a faster algorithm here io.WriteString(h, key) idx := h.Sum32() % uint32(len(p.stores)) return &p.stores[idx] } func main() { p := NewPartitionedStore(32) p.addSample("test", 15) } So in this case there are 32 pre-computed locks (no need for a mutex on a fixed-size array that doesn't change) and if the keys are uniformly distributed you end up reducing the chances that two goroutines hit the same mutex at the same time. (this approach is language agnostic, so may help the java program too...) On Sunday, October 2, 2016 at 10:17:53 AM UTC-4, toefel18 wrote: > > Hi, > > I've written a small library (https://github.com/toefel18/go-patan) in > that stores counters and collects statistics (running > min/max/average/stddeviation) of a program during runtime. There is a lock > based implementation and a channel based implementation. I've written this > library before in Java as well https://github.com/toefel18/patan. The > Java version with equivalent implementation is much faster than both the > channel and locking implementations in Go, I really don't understand why. > > This program has a store that holds the data, and a sync.Mutex that guards > concurrent access on reads and writes. This is a snippet of the locking > based implementation: > > type Store struct { > durations map[string]*Distribution > counters map[string]int64 > samples map[string]*Distribution > > lock *sync.Mutex > } > > func (store *Store) addSample(key string, value int64) { > store.addToStore(store.samples, key, value) > } > > func (store *Store) addDuration(key string, value int64) { > store.addToStore(store.durations, key, value) > } > > > func (store *Store) addToStore(destination map[string]*Distribution, key > string, value int64) { > store.lock.Lock() > defer store.lock.Unlock() > distribution, exists := destination[key] > if !exists { > distribution = NewDistribution() > destination[key] = distribution > } > distribution.addSample(value) > } > > Now, when I benchmark this GO code, I get the following results (see gist: > benchmark code > <https://gist.github.com/toefel18/96edac8f57f9ad8a4f9a86d83e726aec>): > > 10 threads with 20000 items took 133 millis > 100 threads with 20000 items took 1809 millis > 1000 threads with 20000 items took 17576 millis > 10 threads with 200000 items took 1228 millis > > 100 threads with 200000 items took 17900 millis > > When I benchmark the Java code, there are much better results (see gist: java > benchmark code > <https://gist.github.com/toefel18/9def55a8c3c53c79a4488c29c66f31e5>) > > 10 threads with 20000 items takes 89 millis > 100 threads with 20000 items takes 265 millis > 1000 threads with 20000 items takes 2888 millis > 10 threads with 200000 items takes 311 millis > > 100 threads with 200000 items takes 3067 millis > > > I have profiled the Go code and created a call graph > <https://gist.githubusercontent.com/toefel18/96edac8f57f9ad8a4f9a86d83e726aec/raw/c1ff6452abaefe49c3c688b3d07e4574cbe77264/call-graph.png>. > I interpret this as follows: > GO spends 0.31 and 0.25 seconds in my methods, and pretty much the rest in > sync.(*Mutex).Lock() and sync.(*Mutex).Unlock() > > The top20 output of the profiler: > > (pprof) top20 > 59110ms of 73890ms total (80.00%) > Dropped 22 nodes (cum <= 369.45ms) > Showing top 20 nodes out of 65 (cum >= 50220ms) > flat flat% sum% cum cum% > 8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex > 7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg > 7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield > 4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32 > 4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd > 4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock > 3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas > 3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load > 2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock > 1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield > 1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable > 1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer > 1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32 > 1250ms 1.69% 71.70% 1250ms 1.69% > github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample > 1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn > 1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack > 1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer > 1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr > 950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire > 860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1 > > > I would really like to understand why Locking in Go is so much slower than in > Java. I've initially written this program using channels, but that was much > slower than locking. Can somebody please help me out? > > > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.