This is something that I have been thinking about as well. And the thing
that mostly concerns me is how multiple operations can be optimised. Part
of that thinking is actually part of my idea to keep track of temporary
objects so that they can be modified in-place. The core of that was already
implemented by Jürgen. Thanks for that. :-)  (by the way, there are plenty
of more functions that could benefit from in-place modification)

About multithreading. I have a few concerns, and I'll outline them all here
for posterity:

*Task dispatch and job construction*

If you have a large array A (say, 8k rows) and you perform, say, the
operation 1+2×A. If you have 8 cores, you want to perform the following
operations:

    Core 1: {1+2×⍵}¨A[⍳1000;]
    Core 2: {1+2×⍵}¨A[1000+⍳1000;]
    Core 3: {1+2×⍵}¨A[2000+⍳1000;]
    *etc...*

Note that with in-place modifications this can be quite efficient.

However, note that there must be some intelligent algorithm figuring out
that the sequence 1+2× can be merged into a single threadable job.

If this optimisation is not done, then we'll probably see inefficiencies
like this:

With a plain non-optimised multi-core dispatching we would be wasting a lot
of time doing dispatch and subsequent collections of results:

    Core 1: {2×⍵}¨A[⍳1000;]
    Core 2: {2×⍵}¨A[1000+⍳1000;]
    *etc...*
Then, collect the result of the multiplication followed by another dispatch
to execute:
    Core 1: {1×⍵}¨A[⍳1000;]
    Core 2: {1×⍵}¨A[1000+⍳1000;]
    *etc...*

Unless the arrays that you are working on are very large, you are going to
spend more time synchronising the threads than actually doing real work.

*Thread model*

Unless all threadable operations always take the same amount of time (i.e.
no branching) a simple thread-pool will not be enough. One would need some
kind of job-stealing queue system. There are several available for C++, but
as I'm not a C++ developer I don't have any experience with them.
However, Threading
Building Blocks
<http://en.wikipedia.org/wiki/Threading_Building_Blocks>looks
promising.

*Synchronisation*

Having the EACH operator being threadable sound incredibly tempting, but
how should the system deal with synchronisation? It's clear that most
operations does not have to worry about synchronisation since they only
work on its small slice of an array, but what about something like this:

    N←0 ◊ {⍵+N←N+1}¨⍳100

The easy way around this would be to simply not multi-thread functions that
contains the assignment operator. The benefit of this is that one can do
away with locking almost completely.

*Multithreaded memory management*

Right now, instances of Value are allocated using the default allocator.
This allocator calls down to malloc(), which is not the fastest in the
world. Most implementations of malloc() also have a single lock, making the
allocation operation effectively single-threaded. Since GNU APL does a lot
of memory allocations, this will definitely be a bottleneck. I would
suggest using an allocator that has a thread-local memory pool, ensuring
that two parallel memory allocations will not interfere with each other.

I'm sure there are more issues, but these are the ones I have been thinking
about so far. Comments welcome.

Regards,
Elias


On 10 March 2014 19:44, Juergen Sauermann <juergen.sauerm...@t-online.de>wrote:

> Hi David,
>
> I think your ideas are correct. I have planned multicore support for GNU
> APL 1.4 and
> every help is welcome.
>
> Actually parallel processing was one of the main reasons for me to write
> GNU APL.
> I published a Ph.D thesis "A parallel APL computer" (in German) back in
> 1990. We had
> built a prototype based on multiple MC68020 processors and demonstrated
> that most
> APL primitive functions and operators can benefit from parallel execution.
> The only
> thing missing at that time was the interpreter (which is now GNU APL).
>
> My thoughts so far were going in the direction of OpenMP which uses a
> similar approach
> as the one you describe. But I haven't worked with it yet and there are
> other solutions around.
>
> The most important step is to find the best approach for GNU APL. That
> probably means a lot
> of benchmarking and experimenting.
>
> In GNU APL every primitive function is a separate object so we can decide
> on a per-function
> basis at what length we go multi-core.
>
> /// Jürgen
>
>
>
>
>
> On 03/10/2014 06:33 AM, David B. Lamkins wrote:
>
>> This weekend I spent a few hours poking around in the GNU APL source
>> code while thinking about what it might take to exploit a multicore CPU.
>> I've collected my thoughts on what it might take to do an initial
>> implementation (see attached).
>>
>> Juergen (and anyone else who has worked in the code base), I'd
>> appreciate some feedback regarding the feasibility of my proposal, as
>> well as your comments about things I've overlooked.
>>
>> This is something I'd be willing to work on in my spare time.
>>
>>
>>
>
>

Reply via email to