> > > This program has a store that holds the data, and a sync.Mutex that guards > concurrent access on reads and writes. This is a snippet of the locking > based implementation: > > type Store struct { > durations map[string]*Distribution > counters map[string]int64 > samples map[string]*Distribution > > lock *sync.Mutex > } > > If I were to write this, I would have one Store per goroutine. Then at intervals, you send this Store to a central location (over a channel) and tally up the counts. In turn, you avoid locking in the tight loops since each goroutine owns the Store it operates on, until you make a copy of it and send it along. approach, which is a bit harder to pull of in Go, is to have a mutex/Store per core. Read operations then take all mutexes for reading, but write operations will almost never compete on the lock, so it becomes free of contention.
The reason this works is because your counters work as CRDT P-counters and max/min are CRDTs on their own. So you don't have to synchronize at all[0]. Also, if you keep the count, the sum and the sum of squares, it is easy to compute the mean and the std. dev. of the data set. All form CRDTs [1]. The key to fast concurrent programming which achieves good parallelism is to avoid having to synchronize at all. (Aside: both of the above schemes have been used by me with success in Erlang, so I know they work in practice as well as in theory). > I would really like to understand why Locking in Go is so much slower than in > Java. I've initially written this program using channels, but that was much > slower than locking. Can somebody please help me out? > > Javas compiler employs a number of optimizations w.r.t. Locking: * Escape analysis and lock elision: If we can prove data is thread-local and never escapes the thread, then we can elide the lock and never take it. * Lock coarsening: Suppose we take and release the same lock multiple times in a row. Say you roll out the inner loop of your benchmark a bit. Then you can take the lock once, update several times and release the lock. This is called coarsening the lock. If the lock is heavily contended, this may be faster. Especially if other optimizations allows us to strength-reduce the updates and fold several of them into a single update. Which would be exactly what we do once we have unrolled the loop 4-8 times. There is also the ability to exploit modern CPUs hardware transactional memory: If you have a coarse lock over a large array, and most updates affect different indexes in the array, then you can run a transaction optimistically under the assumption threads will write to different slots. On a collision, you restart the transaction, this time with normal mutexes. It is sometimes called hardware lock elision, but the actual state of it on modern CPUs eludes me. The extensions were there but I think they were disabled by microcode since there were bugs in them (At least on the Haswell generation of Intel CPUs). Also, I don't think Java has this, yet, but I might be mistaken. [0] CRDT: Commutative/Convergent Replicated Data Type. [1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene, which combines the ideas of Flajolet's HyperLogLog with the structure of floating point representations. Histogram updates are measured in the lower nanoseconds. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.