mike, hi, thank you for replying.

[my response is going direct to list to my original post because i am not
subscribed on-list.]

unfortunately, integration of aspex's proprietary tool-chain - written
in modula-2 - is extremely unlikely to ever be integrated into gcc.

secondly, the code it generates is c-code - not any kind of assembler.

that c-code places instructions onto a memory-mapped FIFO queue.

there is no modified processor whereby different assembly
instructions are required or even can be generated, and as
a result, the PCI card containing the ASP processors can go
into _any_ standard hardware with _any_ standard processor.

[unlike x86 MMX instructions and unlike MasPar hardware and
unlike Sony PlayStations]

further integration into gcc is, i believe in this instance, both
unnecessary, undesirable and, as you _and_ i are both aware, a
costly and complex task.

additionally, full and transparent integration (i.e. automatic
recognition of arrays of ints and turning them into hardware-accelerated
ASP code) is a YET MORE complex task.

regenerating template-instantiated code-fragments, "outsourcing" them
to aspex's toolchain and re-running context-sensitive gcc parsing on them
is the most "sane" from-here-to-there way that i can think of doing
things.

it also represents an alternate "way out" that doesn't force
you to go the whole hog of [MasPar?  i presume by OpenMP you
mean maspar] MP's "plural" syntax.


the rest of this message comprises a justification of the above
conclusions, followed by some yapping about maspar's modified version
of gcc.



Aspex's toolchain is a mish-mash of c-code and what they call "aop"
statements - which identify to the pre-processor that the code should
be replaced [ultimately with memory-mapped FIFO instructions written
in c-code].

the c-code libraries that you need to call are to set up and communicate
with the ASP's DMA engine.

typically, work comprises:

* calling up a DMA "write" operation which goes ahead whilst
  also performing calculations [from a former operation]

* then when the data is "there", _and_ when the former calculations
  work is completed, transferring the DMA'd data to registers
  (one clock cycle, wheee :).

* then also initiating a DMA "out" operation to get the former completed
  work _out_ whilst waiting for the "new" work to complete..

... you get the idea.

the DMA operations are done - once again - by writing instructions down
the PCI bus, but these are handled via a c-code library.

you also need to insert "delay" instructions into the c-code, ranging
from nanoseconds to microseconds, depending on how the ASP string is
"segmented" / subdivided, due to ripple / intercommunication along the
sequentially-connected ALUs.  if all the gates are open, and a number of
ASP processors are linked together off-chip, you're looking at
microseconds in communication time; if they're all independent and you
only have ... say... 4096 groups of 16 APEs, you're looking at
nanosecond (i.e. probably unnecessary) delay times.


so, when i say that full and transparent integration [a la "plural int"
a la MasPar] would be complex i really MEAN it would be complex [it's
enough of a job to consider doing a hardware-accelerated version
of the c++ valarray STL].

not least because the ALUs on the ASP are 2-bit processors, and a string
of 16 processors can therefore either be utilised to perform two single
16-bit "Add" instructions, in parallel/pipelined, or they can be
utilised to perform Qty 16-of N-bit "Add" instructions in N/2 clock
cycles.

and anything in between!!!!!

and the choice about which to use is usually made by the PROGRAMMER,
depending on how much data is being transferred in-and-out: a really
high-speed parallelised "Add" could result in answers being available
so fast that you cannot get them in and out with the DMA engine quick
enough!

... and the amount of work required to ascertain which approach is
best is MIND-numbingly large.  even _estimating_ which is the best
approach could take days.

and, with c++ valarray STL assistance, the development process could
potentially take MINUTES instead of WEEKS.


massively parallel bit-level ALUs represents the EXTREME in
flexible and hard-core raw processing power.

you are NEVER - not if you are in your right mind - going to
get THAT integrated into gcc.  EVER.

:)


... all that having been said, i would _love_ to see the MP
"plural" syntax integrated back into gcc.

i _did_ spend a significant amount of effort trying to track
down the original MP-modified tarball - several months, in fact.

the original developer remained elusive (and, in all
probability, quite upset at MasPar having been shut down),
and the hard drive on which the development of gcc had been
done was known by one of his colleagues to have been sitting
on a shelf for several years.

the location where the modified maspar-modified gcc was
_supposed_ to be downloadable from is subscriber-only, and
they (the company that bought out and then shut down maspar)
are playing hard-ball.

basically - due to the free software community not paying
attention - the probability of obtaining that code is, unless
we're _extremely_ lucky and someone happens to have it somewhere
and hasn't yet come forward, extremely low.

i'd be absolutely _delighted_ if someone has managed to track that
modified gcc compiler down.

l.

Reply via email to