That’s exactly the plan. The idea is to simulate perhaps the workload of a complete chiplet. That might be (assuming no SIMD in the processors to keep the example light) 2K cores. Each worklet is perhaps 25-50 nsec (worklet = work done for one core) for each simulated core
The simplest mechanism is probably that on finishing work, every worker sends a message to main; when it’s got all the messages, it sends a message to each of the workers. Nice and simple. But it seems as though a channel communication is of the order of 100ns, so I’m eating 200nsec per phase change in each worker With 10 executing cores and 2K simulated cores, we get to do around 200 worklets per phase per executing core. At 25ns per worklet that’s 5 microseconds of work per worker, and losing 200nsec out of that will still let the thing scale reasonably to some useful number of cores. But tools are more useful if they’re relative broad-spectrum. If I want to have 20 worklets per core per phase (running a simulation on a subset of the system, to gain simulation speed), I now am using ~ 200ns out of 500 ns of work, which is not a hugely scalable number at all. Probably it’d run slower than a standard single core sequential implementation regardless of the number of cores, so not a Big Win Were the runtime.Unique() function to exist (a hypothetical scheduler call that, for some number of goroutines and cores, allows a goroutine to declare that it should be the sole workload for a core; limited to a fairly large subset of available cores) I could spinloop on an atomic load, emulating waitgroup/barrier behaviour without any scheduler involvement and with times closer to the 10ns level (when worklet path lengths were well-balanced) I’d also welcome the news that under useful circumstances channel communication is only (say) 20ns. That’d simplify things beauteously. (All ns measured on a ~3GHz Core i7 of 2018) [Historical Note: When I were a young lad, I wrote quite a bit of stuff in occam, so channel-y stuff is lapsed second nature - channels just work - and all this building barrier stuff is terribly unnatural. So my instincts are (were?) good, but platform performance doesn’t seem to want to play along] > On Jan 17, 2021, at 9:21 AM, Robert Engels <reng...@ix.netcom.com> wrote: > > If there is very little work to be done - then you have N “threads” do M > partitioned work. If M is 10x N you’ve decimated the synchronization cost. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer http://bsc.es/disclaimer -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/C323FCEF-74EB-4DB5-B0BA-6F523E49FF83%40bsc.es.