>
>
> This program has a store that holds the data, and a sync.Mutex that guards
> concurrent access on reads and writes. This is a snippet of the locking
> based implementation:
>
> type Store struct {
>    durations map[string]*Distribution
>    counters  map[string]int64
>    samples   map[string]*Distribution
>
>    lock *sync.Mutex
> }
>
> If I were to write this, I would have one Store per goroutine. Then at
intervals, you send this Store to a central location (over a channel) and
tally up the counts. In turn, you avoid locking in the tight loops since
each goroutine owns the Store it operates on, until you make a copy of it
and send it along. approach, which is a bit harder to pull of in Go, is to
have a mutex/Store per core. Read operations then take all mutexes for
reading, but write operations will almost never compete on the lock, so it
becomes free of contention.

The reason this works is because your counters work as CRDT P-counters and
max/min are CRDTs on their own. So you don't have to synchronize at all[0].
Also, if you keep the count, the sum and the sum of squares, it is easy to
compute the mean and the std. dev. of the data set. All form CRDTs [1].

The key to fast concurrent programming which achieves good parallelism is
to avoid having to synchronize at all.


(Aside: both of the above schemes have been used by me with success in
Erlang, so I know they work in practice as well as in theory).

> I would really like to understand why Locking in Go is so much slower than in 
> Java. I've initially written this program using channels, but that was much 
> slower than locking. Can somebody please help me out?
>
> Javas compiler employs a number of optimizations w.r.t. Locking:

* Escape analysis and lock elision: If we can prove data is thread-local
and never escapes the thread, then we can elide the lock and never take it.

* Lock coarsening: Suppose we take and release the same lock multiple times
in a row. Say you roll out the inner loop of your benchmark a bit. Then you
can take the lock once, update several times and release the lock. This is
called coarsening the lock. If the lock is heavily contended, this may be
faster. Especially if other optimizations allows us to strength-reduce the
updates and fold several of them into a single update. Which would be
exactly what we do once we have unrolled the loop 4-8 times.

There is also the ability to exploit modern CPUs hardware transactional
memory: If you have a coarse lock over a large array, and most updates
affect different indexes in the array, then you can run a transaction
optimistically under the assumption threads will write to different slots.
On a collision, you restart the transaction, this time with normal mutexes.
It is sometimes called hardware lock elision, but the actual state of it on
modern CPUs eludes me. The extensions were there but I think they were
disabled by microcode since there were bugs in them (At least on the
Haswell generation of Intel CPUs). Also, I don't think Java has this, yet,
but I might be mistaken.

[0] CRDT: Commutative/Convergent Replicated Data Type.
[1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by
Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene,
which combines the ideas of Flajolet's HyperLogLog with the structure of
floating point representations. Histogram updates are measured in the lower
nanoseconds.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to