mike, hi, thank you for replying. [my response is going direct to list to my original post because i am not subscribed on-list.]
unfortunately, integration of aspex's proprietary tool-chain - written in modula-2 - is extremely unlikely to ever be integrated into gcc. secondly, the code it generates is c-code - not any kind of assembler. that c-code places instructions onto a memory-mapped FIFO queue. there is no modified processor whereby different assembly instructions are required or even can be generated, and as a result, the PCI card containing the ASP processors can go into _any_ standard hardware with _any_ standard processor. [unlike x86 MMX instructions and unlike MasPar hardware and unlike Sony PlayStations] further integration into gcc is, i believe in this instance, both unnecessary, undesirable and, as you _and_ i are both aware, a costly and complex task. additionally, full and transparent integration (i.e. automatic recognition of arrays of ints and turning them into hardware-accelerated ASP code) is a YET MORE complex task. regenerating template-instantiated code-fragments, "outsourcing" them to aspex's toolchain and re-running context-sensitive gcc parsing on them is the most "sane" from-here-to-there way that i can think of doing things. it also represents an alternate "way out" that doesn't force you to go the whole hog of [MasPar? i presume by OpenMP you mean maspar] MP's "plural" syntax. the rest of this message comprises a justification of the above conclusions, followed by some yapping about maspar's modified version of gcc. Aspex's toolchain is a mish-mash of c-code and what they call "aop" statements - which identify to the pre-processor that the code should be replaced [ultimately with memory-mapped FIFO instructions written in c-code]. the c-code libraries that you need to call are to set up and communicate with the ASP's DMA engine. typically, work comprises: * calling up a DMA "write" operation which goes ahead whilst also performing calculations [from a former operation] * then when the data is "there", _and_ when the former calculations work is completed, transferring the DMA'd data to registers (one clock cycle, wheee :). * then also initiating a DMA "out" operation to get the former completed work _out_ whilst waiting for the "new" work to complete.. ... you get the idea. the DMA operations are done - once again - by writing instructions down the PCI bus, but these are handled via a c-code library. you also need to insert "delay" instructions into the c-code, ranging from nanoseconds to microseconds, depending on how the ASP string is "segmented" / subdivided, due to ripple / intercommunication along the sequentially-connected ALUs. if all the gates are open, and a number of ASP processors are linked together off-chip, you're looking at microseconds in communication time; if they're all independent and you only have ... say... 4096 groups of 16 APEs, you're looking at nanosecond (i.e. probably unnecessary) delay times. so, when i say that full and transparent integration [a la "plural int" a la MasPar] would be complex i really MEAN it would be complex [it's enough of a job to consider doing a hardware-accelerated version of the c++ valarray STL]. not least because the ALUs on the ASP are 2-bit processors, and a string of 16 processors can therefore either be utilised to perform two single 16-bit "Add" instructions, in parallel/pipelined, or they can be utilised to perform Qty 16-of N-bit "Add" instructions in N/2 clock cycles. and anything in between!!!!! and the choice about which to use is usually made by the PROGRAMMER, depending on how much data is being transferred in-and-out: a really high-speed parallelised "Add" could result in answers being available so fast that you cannot get them in and out with the DMA engine quick enough! ... and the amount of work required to ascertain which approach is best is MIND-numbingly large. even _estimating_ which is the best approach could take days. and, with c++ valarray STL assistance, the development process could potentially take MINUTES instead of WEEKS. massively parallel bit-level ALUs represents the EXTREME in flexible and hard-core raw processing power. you are NEVER - not if you are in your right mind - going to get THAT integrated into gcc. EVER. :) ... all that having been said, i would _love_ to see the MP "plural" syntax integrated back into gcc. i _did_ spend a significant amount of effort trying to track down the original MP-modified tarball - several months, in fact. the original developer remained elusive (and, in all probability, quite upset at MasPar having been shut down), and the hard drive on which the development of gcc had been done was known by one of his colleagues to have been sitting on a shelf for several years. the location where the modified maspar-modified gcc was _supposed_ to be downloadable from is subscriber-only, and they (the company that bought out and then shut down maspar) are playing hard-ball. basically - due to the free software community not paying attention - the probability of obtaining that code is, unless we're _extremely_ lucky and someone happens to have it somewhere and hasn't yet come forward, extremely low. i'd be absolutely _delighted_ if someone has managed to track that modified gcc compiler down. l.