That’s exactly the plan.

The idea is to simulate  perhaps the workload of a complete chiplet. That might 
be (assuming no SIMD in the processors to keep the example light) 2K cores. 
Each worklet is perhaps 25-50 nsec (worklet = work done for one core) for each 
simulated core

The simplest mechanism is probably that on finishing work, every worker sends a 
message to main; when it’s got all the messages, it sends a message to each of 
the workers. Nice and simple. But it seems as though a channel communication is 
of the order of 100ns, so I’m eating 200nsec per phase change in each worker

With 10 executing cores and 2K simulated cores, we get to do around 200 
worklets per phase per executing core. At 25ns per worklet that’s 5 
microseconds of work per worker, and losing 200nsec out of that will still let 
the thing scale reasonably to some useful number of cores. 

But tools are more useful if they’re relative broad-spectrum. If I want to have 
20 worklets per core per phase (running a simulation on a subset of the system, 
to gain simulation speed), I now am using ~ 200ns out of 500 ns of work, which 
is not a hugely scalable number at all. Probably it’d run slower than a 
standard single core sequential implementation regardless of the number of 
cores, so not a Big Win

Were the runtime.Unique() function to exist (a hypothetical scheduler call 
that, for some number of goroutines and cores, allows a goroutine to declare 
that it should be the sole workload for a core; limited to a fairly large 
subset of available cores) I could spinloop on an atomic load, emulating 
waitgroup/barrier behaviour without any scheduler involvement and with times 
closer to the 10ns level (when worklet path lengths  were well-balanced)

I’d also welcome the news that under useful circumstances channel communication 
is only (say) 20ns. That’d simplify things beauteously. (All ns measured on a 
~3GHz Core i7 of 2018)

[Historical Note: When I were a young lad, I wrote quite a bit of stuff in 
occam, so channel-y stuff is lapsed second nature - channels just work - and 
all this building barrier stuff is terribly unnatural. So my instincts are 
(were?) good, but platform performance doesn’t seem to want to play along]

> On Jan 17, 2021, at 9:21 AM, Robert Engels <reng...@ix.netcom.com> wrote:
> 
> If there is very little work to be done - then you have N “threads” do M 
> partitioned work. If M is 10x N you’ve decimated the synchronization cost. 



WARNING / LEGAL TEXT: This message is intended only for the use of the 
individual or entity to which it is addressed and may contain information which 
is privileged, confidential, proprietary, or exempt from disclosure under 
applicable law. If you are not the intended recipient or the person responsible 
for delivering the message to the intended recipient, you are strictly 
prohibited from disclosing, distributing, copying, or in any way using this 
message. If you have received this communication in error, please notify the 
sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 






http://bsc.es/disclaimer

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/C323FCEF-74EB-4DB5-B0BA-6F523E49FF83%40bsc.es.

Reply via email to