Re: [numbers][gsoc] GSoC 2022 - NUMBERS-186 Proposal

Alex Herbert Sun, 12 Jun 2022 17:08:57 -0700

On Sun, 12 Jun 2022 at 16:55, Sumanth Rajkumar <rajkumar.suma...@gmail.com>
wrote:


>
>
> I have provided both approaches. Functional interfaces that operate on
> single complex and on Arrays
>

Thanks for the examples.


> <-- SNIP -->
>
>   Here the implementation including iteration/cursors has moved to the
> array based Lambda.
>   For Complex arrays, I think it is beneficial to move the iteration logic
> from data storage (ComplexList) to the Lambda  (ComplexFunctions::conj)
>

Possible mix up between lambda and method reference? I certainly would not
want to write a lambda function (anonymous method) that is responsible for
iteration; I would stick to providing a simple unary operator that can be
passed to a list to act on each member.

Re: Iteration logic in the array method. I am not sure about this. If the
underlying data is not linear (e.g. a 2D double[][]) then performing a
range based method on it may not be optimal. It should be up to the
implementation to decide the optimal iteration over the underlying data,
possibly specialising a spliterator to split based on stripes of the
underlying data.

Anyway, this set of examples for item-based or range-based approaches
confers no benefit to the range based approach. It just does a potentially
sub-optimal iteration of a unary operator on each item. But this is due to
the example not being a true range based operation. I think the range based
API should be based around operations that require access to more than one
element in the range in order to function. For example this would be
transform operations.

All the other examples we have act on single complex numbers. In this case
an operator is provided to act on the complex and create a complex result.
This can be passed to the container which will optimally iterate over the
data, or sub-range of, and apply the operation.

One issue is parallelisation using a stream. By its nature the stream acts
on single items; the results are then collated at the end. The stream does
not hold indices for each item. So you cannot use a terminal foreach()
method on each sub-range created by splitting and write back to the
original storage. The stream would have to be declared as ordered to
maintain the original order for the output and this can be collated at the
end to new storage.

To use the stream API with a parallel implementation and zero-allocation
for the terminal operation would require the stream contains an item that
allows read and write. So you can use ComplexDoubleArray:

Stream<ComplexDoubleArray> s;
Consumer<ComplexDoubleArray> op;
s.forEach(op);

Here the operator has to have an array that knows its start and length.
This would be more like a buffer. Each terminal forEach would require a
different instance of ComplexDoubleArray that holds its own start and
length. These may be accessed directly or via an iterator that provides
read and write to each element in the range. The ComplexDoubleArray can
wrap the same underlying data so the only allocation is the object in the
stream wrapping the data and a range to act on. This would be N objects
allocated equal to the number of threads.

Then you write a function that accepts a ComplexDoubleArray and writes back
to the same array. I do not see how this is any better than streaming a
mutable complex:

Stream<ComplexNumber> s;
Consumer<ComplexNumber> op;
s.forEach(op);

ComplexNumber simply has read/write for real and imaginary parts and your
operator can do what it requires. Here ComplexNumber is some object
allocated once in the forEach method that acts as a cursor over the
underlying data.

However the stream examples can be misused. The ultimate method can accept
the stream objects and use them for something else. For example the
ComplexDoubleArray covering part of the range can then be added to
(assuming the ComplexDoubleArray is also a List<Complex>). So the sub-range
should be a correct subList implementation. The ComplexNumber cursor could
be added to a List<ComplexNumber> which would have many entries but
ultimately only N actual numbers (N=parallel threads) if the ComplexNumber
was a recycled cursor object. So the specialised stream methods should not
be public.

Instead you could create your operators using factory methods. Here are
examples for single-thread and parallel implementation. The parallel
implementation uses the stream API to efficiently divide the range.

public class ComplexFunctions {
    public static ComplexDoubleUnaryOperator create(ComplexFunction<Void>
operator) {
        return new ComplexDoubleUnaryOperator() {
            @Override
            public ComplexDoubleArray apply(ComplexDoubleArray input,
ComplexDoubleArray result) {
                for (int i = 0; i < input.size(); i++) {
                    final int index = i;
                    ComplexDouble c = input.get(i);
                    operator.apply(c.real(), c.imag(), (re, im) -> {
                        result.setValue(index, re, im);
                        return null;
                    });
                }
                return result;
            }
        };
    }

    public static ComplexDoubleUnaryOperator parallel(ComplexFunction<Void>
operator) {
        return new ComplexDoubleUnaryOperator() {
            @Override
            public ComplexDoubleArray apply(ComplexDoubleArray input,
ComplexDoubleArray result) {
                IntStream.range(0, input.size()).parallel().forEach(i -> {
                    ComplexDouble c = input.get(i);
                    operator.apply(c.real(), c.imag(), (re, im) -> {
                        result.setValue(i, re, im);
                        return null;
                    });
                });
                return result;
            }
        };
    }
}

All this requires is the ComplexDoubleArray to provide efficient get/set
methods with an index.

ComplexDoubleArray a;
a.apply(ComplexFunctions.create(ComplexFunctions::conj));
a.apply(ComplexFunctions.parallel(ComplexFunctions::conj));
a.apply(ComplexFunctions2.parallel(
    (r, i, result) -> result.apply(r, 0)));

Note that here we can reuse all the ISO c99 functions that operate on a
single number. They need not be re-coded for each
ComplexDoubleUnaryOperator.

So I am seeing a requirement to have the item based API. The range based
API needs more concrete implementation examples to solidify how it should
work.



>
> What are you streaming? ComplexDoubleArray sub-arrays? Would the
> > ComplexParrallelStreamFunctions class be responsible for creating the
> > Spliterator for this to control sub-ranges?
> >
>
> Yes. The parallelization details would be abstracted by
> ComplexParrallelFunctions
>

There may be issues with streams regarding the choice of the ranges to
stream in parallel. Ideally the container should control how the stream can
be split. In my example above using a simple range split with an index in
an IntStream the split may be sub-optimal.


>
> The developer could choose explicitly to call
> list.apply(ComplexFunctions::conj) or
> list.apply(ComplexParallelFunctions::conj)
>

I am not sure we want to provide all the ISO c99 implementations in
duplicate. They are defined on Complex and should be implemented once. The
list is just a container to allow efficient storage. I do not think it
should define the ISO c99 functions in the interface.


>
>
> ComplexList.conj() could choose to apply sequential or parallel etc based
> on some config.
>
>
>
> https://github.com/sumanth-rajkumar/commons-numbers/blob/1f013bfcf7d92c4daa160284c3d6a2ec26fd6464/commons-numbers-complex/src/main/java/org/apache/commons/numbers/complex/ComplexList.java#L683
>
> If you look at how this can be used then there is a lot of excess object
> > overhead here when you wish to write code for complex number computation
> in
> > an OO way:
> > See org.apache.commons.math4.legacy.analysis.solvers.LaguerreSolver
> >
>
>  I have included both implementations as explained above.
> I have tried to avoid excess object creation in the Array approach,
>
> Even when using Array approach for single Complex instance, object creation
> overhead is same as Single number approach
>

It is unusual to have setters return a new object. This is what is required
for Complex to implement ComplexDoubleArray setters to allow the operation
to set the result. Essentially this requires the immutable final class
Complex to implement an interface which is for a mutable object. So code
reuse here is polluting the API.


> Could you please take a quick look and confirm?
> The unit test for both approaches for conj operation on ComplexList is here
>
>
> https://github.com/sumanth-rajkumar/commons-numbers/blob/1f013bfcf7d92c4daa160284c3d6a2ec26fd6464/commons-numbers-complex/src/test/java/org/apache/commons/numbers/complex/ComplexListTest.java#L94
>
> Bulk operations such as FFT would require the entire data to be operated
> > on. So this would not work in parallel streams using sub-ranges. Using a
> > 3rd party such as JTransforms would require extracting the data for their
> > API.
> >
>
> I see parallelizations in jtransform...
>
> https://github.com/wendykierp/JTransforms/blob/master/src/main/java/org/jtransforms/fft/DoubleFFT_1D.java#L688


But the DoubleFFT_1D object is created with n = size of data. This uses
precomputation for the size of arguments that will be transformed. Note
that the object is in fact thread safe and can be reused after construction
in concurrent threads. But the data passed to it must be of the correct
size. In this case you do not process in parallel using a stream. The
sub-ranges are carefully chosen and processed using a common thread pool
created by JTransforms. The parallelism is configured based on the size of
the data, below a certain size only a single thread is used.


>
> We can add more methods to ComplexDoubleArray to convert/to from for other
> providers as needed such as DoubleBuffer formats etc..
>

A utility class to map the ComplexDoubleArray to various output formats,
perhaps specified by an enum (since there are not that many possible
formats in 1D). But what of ND data? Thus far we have only considered how
to operate on a 1D linear range. Conversions would require a way to specify
the current format and the output format regarding striped (reals then
imaginary), interleaved, paired arrays, etc. for all dimensions. If ND is
represented in a linear array, this requires knowledge of the packing of
the data, e.g.: index = y * maxx + x.

There are typically such data structures in image libraries, often dealing
with 5D data, that do this that could form inspiration. Typically data is
arranged in a way that is optimal for the application and there are many
possible combinations. E.g. image data is often multiples of 2D XY data, so
may be a float[][] representing XYZ, Channel and Time. In the case of 2D
image a FFT is performed by 2 1D transforms so you extract strips of X
data, transform, then extract strips of Y from the partially transformed
XY, and repeat.

It may be that specifying a generic API for range operations on ND data may
not be possible. It would require further reading to narrow down possible
scope.


>
>
> > Note: Vector functions on a single complex number is not applicable.
> >
> >
>    Yes, in general there won't be any ComplexVectorFunctions for the single
> pair functional interface.
>    We provide ComplexVectorFunctions operations only for List based
> operation that would benefit from Vector based SIMD parallelization
>    However if we pass a List of size 1, it would still work, but possibly
> not perform as well.
>

The Vector API is for more than 1. The javadoc examples note that the
number of lanes is typically 4 or 8. Any elements remaining after
processing multiples of the lane count will be processed in a sequential
loop using non-Vector API code, i.e. the method is coded twice. So a
length=1 array is never going to use the Vector API.

Alex

Re: [numbers][gsoc] GSoC 2022 - NUMBERS-186 Proposal

Reply via email to