On Mon, 17 Feb 2020 at 19:57, Dr. Jürgen Sauermann <mail@jürgen-sauermann.de>
wrote:

> The discussions around paralellisation let me to start thinking about ways
> to improve this in APL, and I was wondering if it would be possible to make
> a purely immutable APL implementation. If so, it would, at least in theory,
> bepossble to defer a lot of computations until a result is actually
> requested. This would be a lot easier to parallelise.
>
> Not quite sure what purely immutable implementation means. Is it lazy
> evaluation where the computation of the result is deferred until the
> result is needed?
>

Yes, that is exactly what it means. Executing a catenation of two arrays
will not create a new, larger arrays. Instead, it preserves the references
to the old arrays and creates a special CatenatedArray that acts as a
facade in front of them. Only if you actually read a value from this array
will that value (and only that value) be computed. In the catenation case,
that means simply picking the correct value from the proper underlying
array.

I very much appreciate your focus on practical relevance. Over time I
> have seen some proposals onthe web that were promising theoretical
> improvements but were easily debunked when looking at the details.
> What we should aim at is the total execution time of a real world APL
> application.
>

Absolutely. My goal here is not to find algorithms that I can optimise, but
rather to determine how real-world algorithms work with an immutable and
lazy implementation.

+/⍳ is an example that was mentioned in the context of lazy evaluation
> and it is a very good example for what we should NOT aim at. The good
> news is that the time-complexity of +/⍳ ran be reduced from
> O(N) down to O(1). Perfekt. Shall we optimize it? Certainly not. The
> reason is that +/⍳ is simply a programming mistake. If a programmer
> chooses to write +/⍳N instead of, for example, (-N-N×N)÷2, then
> improving on that is only fixing obvious mistakes rather than improving
> APL. Already 9-year old Freddy Gauß knew how to compute +/⍳ so we
> should require that an APL programmer knows it as well.
>

 +/⍳ is only constant is space in my version. It's still O(n) in time. I'm
not optimising for this case specifically, but the constant space usage
comes from the fact that ⍳N never realises an actual array of N elements,
but rather simply instantiates an IotaArray containing "virtual" cells.

By coincidence, I was myself thinking about optimizing +/⍳ some years
> ago (it is a rather low hanging fruit with an impressive speedup) but I
> then dropped it due to its lack of practical relevance.
>

And I agree with you fully. It's utterly pointless. I have considered
similar optimisations coming out of the fact that my experiment uses a lazy
evaulation. For example (and simplified), A+B creates a PlusArray(A, B)
which will perform the individual additions when a cell is looked up. It's
conceivable to create an optimisation that replaces PlusArray(A,
PlusArray(B, C)) with: PlusArray(A, B+C), if B and C are scalars.

It is things like this that I would like to explore, because it's very
likely that such optimisations will not be worth it unless the arrays are
very large.

> Also, x[N]←V is invalid code since arrays cannot be modified.
>
> I suppose that modifying variables is a rather common programming
> style in APL. Not allowing it for the sake of simpler parallelization can
> easily fire back (increasing the total execution time or space).
>

True. This limitation could be eliminated with some clever refcounting so
that it's possible to know if an underlying array is used in a different
place. Either that, or one can collapse the operation stack prior to
assignment. This is out of scope for my experiments however.

The reason I removed the mutability of arrays is because of this:

A←B+C
B[n]←X
(do something with A)

Since the lazy implementation hasn't evaluated the addition by the time the
assignment happens, the content of A will be affected by the result of the
assignment to B[n]. Completely eliminating modifications of arrays
conveniently removes any possibility for this to happen.

The true challenge of parallelization is not to find faster implementations
> for specific constructs like +/⍳ but to understand the  trade-offs between
> the benefits of parallel execution on the one hand  and the (primarily
> run-time) cost for them on the other.
>

The reason I'm experimenting with this for the purpose of parallelisation
is in order to more efficiently handle something like this:

    foo bar A+B

In this case, this expression will yield resulting array object that looks
something like this:

    FnApply {
        name: foo
        rightArg: FnApply {
            name: bar
            rightArg: PlusArray {
                leftArg: A
                rightArg: B }}}

Each time a value is read from this structure, the operation
foo(bar(plus(A,B))) is performed. In other words, I'm not doing three
different parallelisations, but rather a single one, where each thread can
perform all the operations in one go. This drastically reduces the amount
of synchronisation needed, which I am hypothesising is a contributor to
poor multi-core utilisation.

Also, here is another reason to avoid mutability: If user-defined functions
can be evaluated in parallel, allowing modification of data would be quite
dangerous.

Of course, I have no idea if any of this is actually true. And even if it
is, will it make enough of a difference for real-world code? That's why I
need real examples rather than my simple test code.

In the +/⍳  example this is easy (although useless) since +/⍳ can be
> detected
> statically (at ⎕FX time). However lazy evaluation in the general case has
> far higher run-time cost than +/⍳ so it remains to be seen if those run-
> time cost can ever be be amortized in a real-world application. Many of
> the examples brought up by proponents of lazy evaluation have the
> following design pattern:
>

It does indeed have higher run-time cost. My code is written in Kotlin and
can both run on the JVM and compile to native code. However, the native
code generation in Kotlin is terrible and not usable for a real-world test.
The JVM version runs pretty well, but since it and GNU APL are such
different things it's very hard to draw any conclusions at all.

For the sake of completeness: I've basically only timed +/⍳N for very large
N's. The execution time for the two versions were roughly the same (GNU APL
was a few percent faster?) but the test case itself is heavily skewed
against GNU APL, since my prototype code doesn't do any allocations at all,
while GNU APL needs to fill and update a lot of memory, which will thrash
the cache.

Regards,
Elias

Reply via email to