50ns is approximately the cost of a cross-CPU L2 cache miss. Any time you 
have tight cross-CPU communication, you're going to incur that cost no 
matter how the communication is performed — whether it's via a sync.Mutex, 
a channel, or an atomic write.

The way to eliminate that cost is to eliminate or batch up the cross-core 
communication, so that you incur less than one cross-core cache miss per 
operation (compare https://youtu.be/C1EtfDnsdDs).

On Monday, January 18, 2021 at 10:11:08 PM UTC-5 Peter Wilson wrote:

> Robert
>
> I was interested in channel peformance only because it's the 'go idiom'. 
> Channel communication is relatively complex, and so is a useful upper limit 
> to costs of inter-goroutine synchronisation.
> It's reassuring to see that 50nsec per operation is what my machinery 
> delivers. Simpler operations with atomics will therefore be no slower, and 
> quite possibly quicker. This is good. And checkable by experiment
>
> Once the sync costs are settled, I suspect it's the load-balancing which 
> will rate-limit. That's more of a concern. But first things first.
>
> On Monday, January 18, 2021 at 8:59:15 PM UTC-6 ren...@ix.netcom.com 
> wrote:
>
>> Channels are built with locks which without biased locking can lead to 
>> delays as routines are scheduled / blocked under -especially under 
>> contention. 
>>
>> github.com/robaho/go-concurrency-test might illuminate some other 
>> options to try. 
>>
>> On Jan 18, 2021, at 8:13 PM, Pete Wilson <peter....@bsc.es> wrote:
>>
>> No need to beg forgiveness.
>>
>>
>> For me, the issue is the synchronisation, not how the collection is 
>> specified.
>> You’re right; there’s no reason why a slice (or an array) couldn’t be 
>> used to define the collection. In fact, that’s the plan.
>> But the synchronisation is still a pain.
>>
>> FWIW I built a channel-based experiment.
>>
>> I have N worker threads plus main
>> I have an N channel array toworkers; each gets a pointer to its own 
>> channel
>> I have an N-buffered channel fromthreads
>>
>> workers wait for an input on their input
>> when they get it, they send something on the buffered channel back to main
>> ..each loops L times
>>
>> main sends to each worker on its own channel
>> main waits for N responses on the communal buffered channel
>> .. loops L times
>>
>> The end result is that
>> - at one goroutine per core, message send or receive costs 100-200 nsec
>> - but the 6-core, 12-virtual core processor only runs at 2.5 cores worth 
>> of load.
>> - presumably, the go runtime has too much overhead timeslicing its thread 
>> when there’s only one goroutine per thread or core
>>
>> If we put 32 threads per core, so the go runtime is busily engaged in 
>> multiplexing goroutines onto a thread, cost per communication drops to 
>> around 50-60 ns
>>
>> For my purposes, that’s far too many threads (gororutines) but it does 
>> suggest that synchronisation has an upper limit cost of 50 ns.
>> So tomorrow we try the custom sync approach. Results will be published.
>>
>> Meanwhile, thanks to all for thoughts and advice
>>
>> — P
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/780b3521-d0e0-4678-b9a3-703baaba6e2en%40googlegroups.com.

Reply via email to