On Jan 18, 2021, at 4:39 AM, Kevin Chadwick wrote:
>
> On 1/17/21 4:46 PM, Bakul Shah wrote:
>> I’d be tempted to just use C for this. That is, generate C code from a
>> register
>> level description of your N simulation cores and run that. That is more or
>> less
>> what “cycle based” verilog
On 1/17/21 4:46 PM, Bakul Shah wrote:
> I’d be tempted to just use C for this. That is, generate C code from a
> register
> level description of your N simulation cores and run that. That is more or
> less
> what “cycle based” verilog simulators (used to) do. The code gen can also
> split
> it M
Yes, that is possible.
The simulated cores are already generated functions in C.
It’s my experience that if you can leverage an existing concurrency framework
then life is better for everyone; go’s is fairly robust, so this is an
experiment to see how close I can get. A real simulation system has
I’d be tempted to just use C for this. That is, generate C code from a register
level description of your N simulation cores and run that. That is more or less
what “cycle based” verilog simulators (used to) do. The code gen can also split
it M ways to run on M physical cores. You can also gener
I’d observed that, experimentally, but I don’t think it’s guaranteed behaviour
:-(
> On Jan 17, 2021, at 10:07 AM, Robert Engels wrote:
>
> If you use GOMAXPROCS as a subset of physical cores and have the same number
> of routines you can busy wait in a similar fashion.
>
WARNING / LEGAL
This particular problem is akin to the way GPUs work; in GPU computing, you
generally marshal all your inputs into one buffer and have the outputs from a
single kernel written to an output buffer (obviously there can be more on
either side). The execution overhead of a single invocation is larg
If you use GOMAXPROCS as a subset of physical cores and have the same number of
routines you can busy wait in a similar fashion.
> On Jan 17, 2021, at 9:50 AM, Pete Wilson wrote:
>
> That’s exactly the plan.
>
> The idea is to simulate perhaps the workload of a complete chiplet. That
> mig
That’s exactly the plan.
The idea is to simulate perhaps the workload of a complete chiplet. That might
be (assuming no SIMD in the processors to keep the example light) 2K cores.
Each worklet is perhaps 25-50 nsec (worklet = work done for one core) for each
simulated core
The simplest mechan
If there is very little work to be done - then you have N “threads” do M
partitioned work. If M is 10x N you’ve decimated the synchronization cost.
> On Jan 17, 2021, at 9:11 AM, Pete Wilson wrote:
>
> The problem is that N or so channel communications twice per simulated clock
> seems to ta
The problem is that N or so channel communications twice per simulated clock
seems to take much longer than the time spent doing useful work.
go isn’t designed for this sort of work, so it’s not a complaint to note it’s
not as good as I’d like it to be. But the other problem is that processors
You may be better off maintaining two state *arrays*: one that has the current
values as input and one for writing the computed outputs. At "negtive" clock
edge you swap the arrays. Since the input array is never modified during the
half clock cycle when you compute, you can divide it up in N co
N in this case will be similar to the value of GOMAXPROCS, on the assumption
that scaling that far pays off.
I would love to have the problem of having a 128 core computer…. (Though then
if scaling tops out at 32 cores one simply runs 4 experiments..)
— P
> On Jan 13, 2021, at 10:31 PM, Rober
I have decided to believe that scheduling overhead is minimal, and only
customize if this is untrue for my workload.
[I don’t like customizing. Stuff in the standard library has been built by folk
who have done this in anger, and the results have been widely used; plus
principle of least surpris
Only because I had started out my explanation in a prior thought trying to use
’naked’ read and write atomic fences, and having exactly 3 (main, t0, t1)
exposed all that needed to be exposed of the problem.
Yes, if this scales, there will be more goroutines than 3.
> On Jan 13, 2021, at 9:58 PM,
Dave
Thanks for thoughts.
My intent is to use non-custom barriers and measure scalability, customising if
need be. Not busy-waiting the go routines leaves processor time for other stuff
(like visualization, looking for patterns, etc)
There’s also the load-balancing effect; code pretending to b
On Jan 13, 2021, at 7:21 PM, Peter Wilson wrote:
> So, after a long ramble, given that I am happy to waste CPU time in busy
> waits (rather than have the overhead of scheduling blocked goroutines), what
> is the recommendation for the signalling mechanism when all is done in go and
> everything
This https://github.com/robaho/go-concurrency-test might help. It has some
relative performance metrics of some sync primitives. In general though Go uses
“fibers” which don’t match up well with busy loops - and the scheduling
overhead can be minimal.
> On Jan 13, 2021, at 6:57 PM, Peter Wilso
Folks
I have code in C which implements a form of discrete event simulator, but
optimised for a clocked system in which most objects accept input on the
positive edge of a simulated clock, do their appropriate internal
computations, and store the result internally. On the negative edge of the
c
18 matches
Mail list logo