Hi Blake, The state of GNU APL is, I believe, this: I am not aware of any unnecessary copying of arrays in GNU APL. There were some suspicions claimed earlier that some copying could be avoided. But it then turned out that removing these copies would corrupt other values under certain circumstances (because different functions would modify the same value). We therefore had to revert the "unnecessary copies" to a state that seems to be safe now. Regarding parallel processing, the situation seems to be so that a multi-core CPU cannot be significantly faster than a single-core CPU with the same memory (-interface). A significant speedup requires that the bandwidth between the cores and the memory grows with the number of cores. I had built a machine like that with some students back in 1990, but the current PCs with multi-core CPUs just do not provide that. If you make parallel1 and then run ScalarBenchmark.apl (which comes with GNU APL) then you get, for example: -------------- Mix_IRC + Mix1_IRC -------------- average sequential startup cost: 530 cycles average parallel startup cost: 1300 cycles per item cost sequential: 122 cycles per item cost parallel: 195 cycles parallel break-even length: not reached This means that: the additional start-up cost for a parallel computation is 1300-530=770 cycles or 240 nano-seconds on a 3.2 GHz machine. This is actually a pretty good value. Before writing my own core synchronization functions I used a standard paralle;ization library that took aboutc 20,000 cycles. I believe it was libmpi but that I was many years ago so I don't quite remember. This startup cost also includes what Elias refers to as bookkeeping). What remains (and is the show-stopper) is the per item cost A single core needs 122 cycles for adding two numbers while 4 cores need 195 cycles per core = 780 cycles in total. The code in both cases is exactly the same. Putting it differenly, if I run alone then some function takes 122 cycles and when my collegues work in parallel on something that has nothing to do with my work then I need 780 cycles. Once the parallel startup has finished the cores work independently and without any locks or the like between them. This cannot be explained at software level but rather suggests that some common resource (memory ?!) is slowing each core down when some other core(s) are working at the same time. The 122 cycles (~40 ns) in the single core case is roughly the time for one DRAM access in page mode. In other words, the main memory bandwidth (at least of my machine) is just enough to feed one core, but far to low for 4 cores. GNU APL can do little to fix this bottleneck. I believe I have done everything possible (although new ideas are welcome) that can be done in software, but if we hit hardware bottlenecks then then thats it. I believe GNU APL would run perfectly on a 4-core CPU with a 4-port main memory, but that will probably remain a dream in my lifetime. Best Regards, Jürgen Sauermann On 07/04/2017 06:15 PM, Blake McBride
wrote:
|
- Re: [Bug-apl] Implementing Dyalog Key function Juergen Sauermann
- Re: [Bug-apl] Implementing Dyalog Key function Elias Mårtenson
- Re: [Bug-apl] Implementing Dyalog Key function Juergen Sauermann
- Re: [Bug-apl] Implementing Dyalog Key funct... Elias Mårtenson
- Re: [Bug-apl] Implementing Dyalog Key f... Juergen Sauermann
- Re: [Bug-apl] Implementing Dyalog ... Elias Mårtenson
- Re: [Bug-apl] Implementing Dya... Blake McBride
- Re: [Bug-apl] Implementing Dya... Elias Mårtenson
- Re: [Bug-apl] Implementing Dya... Blake McBride
- Re: [Bug-apl] Implementing Dya... Elias Mårtenson
- Re: [Bug-apl] Implementing Dya... Juergen Sauermann
- Re: [Bug-apl] Implementing Dya... Louis de Forcrand
- Re: [Bug-apl] Implementing Dya... Elias Mårtenson
- Re: [Bug-apl] Implementing Dya... Juergen Sauermann