Re: [go-nuts] Advice, please

2021-01-18 Thread Bakul Shah
On Jan 18, 2021, at 4:39 AM, Kevin Chadwick wrote: > > On 1/17/21 4:46 PM, Bakul Shah wrote: >> I’d be tempted to just use C for this. That is, generate C code from a >> register >> level description of your N simulation cores and run that. That is more or >> less >> what “cycle based” verilog

Re: [go-nuts] Advice, please

2021-01-18 Thread Kevin Chadwick
On 1/17/21 4:46 PM, Bakul Shah wrote: > I’d be tempted to just use C for this. That is, generate C code from a > register > level description of your N simulation cores and run that. That is more or > less > what “cycle based” verilog simulators (used to) do. The code gen can also > split > it M

Re: [go-nuts] Advice, please

2021-01-17 Thread Pete Wilson
Yes, that is possible. The simulated cores are already generated functions in C. It’s my experience that if you can leverage an existing concurrency framework then life is better for everyone; go’s is fairly robust, so this is an experiment to see how close I can get. A real simulation system has

Re: [go-nuts] Advice, please

2021-01-17 Thread Bakul Shah
I’d be tempted to just use C for this. That is, generate C code from a register level description of your N simulation cores and run that. That is more or less what “cycle based” verilog simulators (used to) do. The code gen can also split it M ways to run on M physical cores. You can also gener

Re: [go-nuts] Advice, please

2021-01-17 Thread Pete Wilson
I’d observed that, experimentally, but I don’t think it’s guaranteed behaviour :-( > On Jan 17, 2021, at 10:07 AM, Robert Engels wrote: > > If you use GOMAXPROCS as a subset of physical cores and have the same number > of routines you can busy wait in a similar fashion. > WARNING / LEGAL

Re: [go-nuts] Advice, please

2021-01-17 Thread David Riley
This particular problem is akin to the way GPUs work; in GPU computing, you generally marshal all your inputs into one buffer and have the outputs from a single kernel written to an output buffer (obviously there can be more on either side). The execution overhead of a single invocation is larg

Re: [go-nuts] Advice, please

2021-01-17 Thread Robert Engels
If you use GOMAXPROCS as a subset of physical cores and have the same number of routines you can busy wait in a similar fashion. > On Jan 17, 2021, at 9:50 AM, Pete Wilson wrote: > > That’s exactly the plan. > > The idea is to simulate perhaps the workload of a complete chiplet. That > mig

Re: [go-nuts] Advice, please

2021-01-17 Thread Pete Wilson
That’s exactly the plan. The idea is to simulate perhaps the workload of a complete chiplet. That might be (assuming no SIMD in the processors to keep the example light) 2K cores. Each worklet is perhaps 25-50 nsec (worklet = work done for one core) for each simulated core The simplest mechan

Re: [go-nuts] Advice, please

2021-01-17 Thread Robert Engels
If there is very little work to be done - then you have N “threads” do M partitioned work. If M is 10x N you’ve decimated the synchronization cost. > On Jan 17, 2021, at 9:11 AM, Pete Wilson wrote: > > The problem is that N or so channel communications twice per simulated clock > seems to ta

Re: [go-nuts] Advice, please

2021-01-17 Thread Pete Wilson
The problem is that N or so channel communications twice per simulated clock seems to take much longer than the time spent doing useful work. go isn’t designed for this sort of work, so it’s not a complaint to note it’s not as good as I’d like it to be. But the other problem is that processors

Re: [go-nuts] Advice, please

2021-01-16 Thread Bakul Shah
You may be better off maintaining two state *arrays*: one that has the current values as input and one for writing the computed outputs. At "negtive" clock edge you swap the arrays. Since the input array is never modified during the half clock cycle when you compute, you can divide it up in N co

Re: [go-nuts] Advice, please

2021-01-14 Thread Pete Wilson
N in this case will be similar to the value of GOMAXPROCS, on the assumption that scaling that far pays off. I would love to have the problem of having a 128 core computer…. (Though then if scaling tops out at 32 cores one simply runs 4 experiments..) — P > On Jan 13, 2021, at 10:31 PM, Rober

Re: [go-nuts] Advice, please

2021-01-14 Thread Pete Wilson
I have decided to believe that scheduling overhead is minimal, and only customize if this is untrue for my workload. [I don’t like customizing. Stuff in the standard library has been built by folk who have done this in anger, and the results have been widely used; plus principle of least surpris

Re: [go-nuts] Advice, please

2021-01-14 Thread Pete Wilson
Only because I had started out my explanation in a prior thought trying to use ’naked’ read and write atomic fences, and having exactly 3 (main, t0, t1) exposed all that needed to be exposed of the problem. Yes, if this scales, there will be more goroutines than 3. > On Jan 13, 2021, at 9:58 PM,

Re: [go-nuts] Advice, please

2021-01-14 Thread Pete Wilson
Dave Thanks for thoughts. My intent is to use non-custom barriers and measure scalability, customising if need be. Not busy-waiting the go routines leaves processor time for other stuff (like visualization, looking for patterns, etc) There’s also the load-balancing effect; code pretending to b

Re: [go-nuts] Advice, please

2021-01-14 Thread David Riley
On Jan 13, 2021, at 7:21 PM, Peter Wilson wrote: > So, after a long ramble, given that I am happy to waste CPU time in busy > waits (rather than have the overhead of scheduling blocked goroutines), what > is the recommendation for the signalling mechanism when all is done in go and > everything

Re: [go-nuts] Advice, please

2021-01-13 Thread Robert Engels
This https://github.com/robaho/go-concurrency-test might help. It has some relative performance metrics of some sync primitives. In general though Go uses “fibers” which don’t match up well with busy loops - and the scheduling overhead can be minimal. > On Jan 13, 2021, at 6:57 PM, Peter Wilso

[go-nuts] Advice, please

2021-01-13 Thread Peter Wilson
Folks I have code in C which implements a form of discrete event simulator, but optimised for a clocked system in which most objects accept input on the positive edge of a simulated clock, do their appropriate internal computations, and store the result internally. On the negative edge of the c