Re: [numbers][gsoc] GSoC 2022 - NUMBERS-186 Proposal

Sumanth Rajkumar Sun, 12 Jun 2022 23:47:01 -0700

Thanks Alex for the detailed feedback.

As GSOC Phase 1 has started, for the existing C99 complex functions in
Complex class, I will refactor using the item based functional interface
approach.


I can look at a Range based approach later if I have time.
As you mentioned the C99 functions cannot take advantage of the limited
Java 16 Vector API features.
However, it may be possible to speed up C99 functions using GPUs based on
quick look at documentation here

https://aparapi.com/introduction/getting-started.html
https://aparapi.com/documentation/kernel-guidelines.html
https://aparapi.com/documentation/aparapi-patterns.html
https://git.qoto.org/aparapi/aparapi-examples/-/blob/master/examples/movie/src/com/amd/aparapi/examples/movie/AparapiSolution.java


For Phase 1, I propose to do the following


1) Introduce ComplexDouble, ComplexDoubleVector and ComplexDoubleMatrix
interfaces
    The interfaces to have methods for applying below unary and binary
double functions
    The interfaces to have methods to convert to/from primitive double
arrays (both separate arrays for real/imaginary and single interleaved
array)
    The interfaces to have methods to convert to/from DoubleBuffer (both
separate arrays for real/imaginary and single interleaved buffer)

2) Introduce generic (item based) functional interfaces for unary and
binary double functions

@FunctionalInterface
public interface ComplexDoubleFunction<R> {

R apply(double real, double imaginary, ComplexDoubleResult<R> result);
}

@FunctionalInterface
public interface ComplexDoubleBiFunction<R> {

R apply(double real1, double imaginary1, double real2, double imaginary2,
 ComplexDoubleResult<R> result);
}

@FunctionalInterface
public interface ComplexDoubleResult<R> {

R apply(double r, double i);

}



3) Add ComplexDouble interface to Complex class.
   Add ComplexDoubleVectorImpl (primitive double array backing)
implementation for ComplexDoubleVector interface
   Add ComplexDoubleMatrixImpl (primitive double array backing)
implementation for ComplexDoubleMatrix interface

4) Refactor existing instance methods in Complex class as static functions
in ComplexDoubleFunctions class using functional interface signatures


5) Add concurrency support for operations on Vector and Matrix objects


interface ComplexDoubleVector {

      default ComplexDoubleVector apply(ComplexDoubleFunction<Void>
function) {
return apply(function, 0, size(), 1);
 }

     default ComplexDoubleVector apply(ComplexDoubleFunction<Void>
function, int concurrency) {
         return apply(function, 0, size(), 1);
 }

    ComplexDoubleVector apply(ComplexDoubleFunction<Void> function, int
start, int length, int concurrency)
}

ComplexDoubleMatrixImpl and ComplexDoubleVectorImpl implement concurrency
(if concurrency > 1) by partitioning the data and execute using fork join
thread pool (No use of streams)


6) Unit tests for Vector and Matrix implementations

Let me know if you need anything else or reduce the scope for phase 1

Thanks
Sumanth

On Sun, 12 Jun 2022 at 20:08, Alex Herbert <alex.d.herb...@gmail.com> wrote:

> On Sun, 12 Jun 2022 at 16:55, Sumanth Rajkumar <rajkumar.suma...@gmail.com
> >
> wrote:
>
> >
> >
> > I have provided both approaches. Functional interfaces that operate on
> > single complex and on Arrays
> >
>
> Thanks for the examples.
>
>
> > <-- SNIP -->
> >
> >   Here the implementation including iteration/cursors has moved to the
> > array based Lambda.
> >   For Complex arrays, I think it is beneficial to move the iteration
> logic
> > from data storage (ComplexList) to the Lambda  (ComplexFunctions::conj)
> >
>
> Possible mix up between lambda and method reference? I certainly would not
> want to write a lambda function (anonymous method) that is responsible for
> iteration; I would stick to providing a simple unary operator that can be
> passed to a list to act on each member.
>
> Re: Iteration logic in the array method. I am not sure about this. If the
> underlying data is not linear (e.g. a 2D double[][]) then performing a
> range based method on it may not be optimal. It should be up to the
> implementation to decide the optimal iteration over the underlying data,
> possibly specialising a spliterator to split based on stripes of the
> underlying data.
>
> Anyway, this set of examples for item-based or range-based approaches
> confers no benefit to the range based approach. It just does a potentially
> sub-optimal iteration of a unary operator on each item. But this is due to
> the example not being a true range based operation. I think the range based
> API should be based around operations that require access to more than one
> element in the range in order to function. For example this would be
> transform operations.
>
> All the other examples we have act on single complex numbers. In this case
> an operator is provided to act on the complex and create a complex result.
> This can be passed to the container which will optimally iterate over the
> data, or sub-range of, and apply the operation.
>
> One issue is parallelisation using a stream. By its nature the stream acts
> on single items; the results are then collated at the end. The stream does
> not hold indices for each item. So you cannot use a terminal foreach()
> method on each sub-range created by splitting and write back to the
> original storage. The stream would have to be declared as ordered to
> maintain the original order for the output and this can be collated at the
> end to new storage.
>
> To use the stream API with a parallel implementation and zero-allocation
> for the terminal operation would require the stream contains an item that
> allows read and write. So you can use ComplexDoubleArray:
>
> Stream<ComplexDoubleArray> s;
> Consumer<ComplexDoubleArray> op;
> s.forEach(op);
>
> Here the operator has to have an array that knows its start and length.
> This would be more like a buffer. Each terminal forEach would require a
> different instance of ComplexDoubleArray that holds its own start and
> length. These may be accessed directly or via an iterator that provides
> read and write to each element in the range. The ComplexDoubleArray can
> wrap the same underlying data so the only allocation is the object in the
> stream wrapping the data and a range to act on. This would be N objects
> allocated equal to the number of threads.
>
> Then you write a function that accepts a ComplexDoubleArray and writes back
> to the same array. I do not see how this is any better than streaming a
> mutable complex:
>
> Stream<ComplexNumber> s;
> Consumer<ComplexNumber> op;
> s.forEach(op);
>
> ComplexNumber simply has read/write for real and imaginary parts and your
> operator can do what it requires. Here ComplexNumber is some object
> allocated once in the forEach method that acts as a cursor over the
> underlying data.
>
> However the stream examples can be misused. The ultimate method can accept
> the stream objects and use them for something else. For example the
> ComplexDoubleArray covering part of the range can then be added to
> (assuming the ComplexDoubleArray is also a List<Complex>). So the sub-range
> should be a correct subList implementation. The ComplexNumber cursor could
> be added to a List<ComplexNumber> which would have many entries but
> ultimately only N actual numbers (N=parallel threads) if the ComplexNumber
> was a recycled cursor object. So the specialised stream methods should not
> be public.
>
> Instead you could create your operators using factory methods. Here are
> examples for single-thread and parallel implementation. The parallel
> implementation uses the stream API to efficiently divide the range.
>
> public class ComplexFunctions {
>     public static ComplexDoubleUnaryOperator create(ComplexFunction<Void>
> operator) {
>         return new ComplexDoubleUnaryOperator() {
>             @Override
>             public ComplexDoubleArray apply(ComplexDoubleArray input,
> ComplexDoubleArray result) {
>                 for (int i = 0; i < input.size(); i++) {
>                     final int index = i;
>                     ComplexDouble c = input.get(i);
>                     operator.apply(c.real(), c.imag(), (re, im) -> {
>                         result.setValue(index, re, im);
>                         return null;
>                     });
>                 }
>                 return result;
>             }
>         };
>     }
>
>     public static ComplexDoubleUnaryOperator parallel(ComplexFunction<Void>
> operator) {
>         return new ComplexDoubleUnaryOperator() {
>             @Override
>             public ComplexDoubleArray apply(ComplexDoubleArray input,
> ComplexDoubleArray result) {
>                 IntStream.range(0, input.size()).parallel().forEach(i -> {
>                     ComplexDouble c = input.get(i);
>                     operator.apply(c.real(), c.imag(), (re, im) -> {
>                         result.setValue(i, re, im);
>                         return null;
>                     });
>                 });
>                 return result;
>             }
>         };
>     }
> }
>
> All this requires is the ComplexDoubleArray to provide efficient get/set
> methods with an index.
>
> ComplexDoubleArray a;
> a.apply(ComplexFunctions.create(ComplexFunctions::conj));
> a.apply(ComplexFunctions.parallel(ComplexFunctions::conj));
> a.apply(ComplexFunctions2.parallel(
>     (r, i, result) -> result.apply(r, 0)));
>
> Note that here we can reuse all the ISO c99 functions that operate on a
> single number. They need not be re-coded for each
> ComplexDoubleUnaryOperator.
>
> So I am seeing a requirement to have the item based API. The range based
> API needs more concrete implementation examples to solidify how it should
> work.
>
>
>
> >
> > What are you streaming? ComplexDoubleArray sub-arrays? Would the
> > > ComplexParrallelStreamFunctions class be responsible for creating the
> > > Spliterator for this to control sub-ranges?
> > >
> >
> > Yes. The parallelization details would be abstracted by
> > ComplexParrallelFunctions
> >
>
> There may be issues with streams regarding the choice of the ranges to
> stream in parallel. Ideally the container should control how the stream can
> be split. In my example above using a simple range split with an index in
> an IntStream the split may be sub-optimal.
>
>
> >
> > The developer could choose explicitly to call
> > list.apply(ComplexFunctions::conj) or
> > list.apply(ComplexParallelFunctions::conj)
> >
>
> I am not sure we want to provide all the ISO c99 implementations in
> duplicate. They are defined on Complex and should be implemented once. The
> list is just a container to allow efficient storage. I do not think it
> should define the ISO c99 functions in the interface.
>
>
> >
> >
> > ComplexList.conj() could choose to apply sequential or parallel etc based
> > on some config.
> >
> >
> >
> >
> https://github.com/sumanth-rajkumar/commons-numbers/blob/1f013bfcf7d92c4daa160284c3d6a2ec26fd6464/commons-numbers-complex/src/main/java/org/apache/commons/numbers/complex/ComplexList.java#L683
> >
> > If you look at how this can be used then there is a lot of excess object
> > > overhead here when you wish to write code for complex number
> computation
> > in
> > > an OO way:
> > > See org.apache.commons.math4.legacy.analysis.solvers.LaguerreSolver
> > >
> >
> >  I have included both implementations as explained above.
> > I have tried to avoid excess object creation in the Array approach,
> >
> > Even when using Array approach for single Complex instance, object
> creation
> > overhead is same as Single number approach
> >
>
> It is unusual to have setters return a new object. This is what is required
> for Complex to implement ComplexDoubleArray setters to allow the operation
> to set the result. Essentially this requires the immutable final class
> Complex to implement an interface which is for a mutable object. So code
> reuse here is polluting the API.
>
>
> > Could you please take a quick look and confirm?
> > The unit test for both approaches for conj operation on ComplexList is
> here
> >
> >
> >
> https://github.com/sumanth-rajkumar/commons-numbers/blob/1f013bfcf7d92c4daa160284c3d6a2ec26fd6464/commons-numbers-complex/src/test/java/org/apache/commons/numbers/complex/ComplexListTest.java#L94
> >
> > Bulk operations such as FFT would require the entire data to be operated
> > > on. So this would not work in parallel streams using sub-ranges. Using
> a
> > > 3rd party such as JTransforms would require extracting the data for
> their
> > > API.
> > >
> >
> > I see parallelizations in jtransform...
> >
> >
> https://github.com/wendykierp/JTransforms/blob/master/src/main/java/org/jtransforms/fft/DoubleFFT_1D.java#L688
>
>
> But the DoubleFFT_1D object is created with n = size of data. This uses
> precomputation for the size of arguments that will be transformed. Note
> that the object is in fact thread safe and can be reused after construction
> in concurrent threads. But the data passed to it must be of the correct
> size. In this case you do not process in parallel using a stream. The
> sub-ranges are carefully chosen and processed using a common thread pool
> created by JTransforms. The parallelism is configured based on the size of
> the data, below a certain size only a single thread is used.
>
>
> >
> > We can add more methods to ComplexDoubleArray to convert/to from for
> other
> > providers as needed such as DoubleBuffer formats etc..
> >
>
> A utility class to map the ComplexDoubleArray to various output formats,
> perhaps specified by an enum (since there are not that many possible
> formats in 1D). But what of ND data? Thus far we have only considered how
> to operate on a 1D linear range. Conversions would require a way to specify
> the current format and the output format regarding striped (reals then
> imaginary), interleaved, paired arrays, etc. for all dimensions. If ND is
> represented in a linear array, this requires knowledge of the packing of
> the data, e.g.: index = y * maxx + x.
>
> There are typically such data structures in image libraries, often dealing
> with 5D data, that do this that could form inspiration. Typically data is
> arranged in a way that is optimal for the application and there are many
> possible combinations. E.g. image data is often multiples of 2D XY data, so
> may be a float[][] representing XYZ, Channel and Time. In the case of 2D
> image a FFT is performed by 2 1D transforms so you extract strips of X
> data, transform, then extract strips of Y from the partially transformed
> XY, and repeat.
>
> It may be that specifying a generic API for range operations on ND data may
> not be possible. It would require further reading to narrow down possible
> scope.
>
>
> >
> >
> > > Note: Vector functions on a single complex number is not applicable.
> > >
> > >
> >    Yes, in general there won't be any ComplexVectorFunctions for the
> single
> > pair functional interface.
> >    We provide ComplexVectorFunctions operations only for List based
> > operation that would benefit from Vector based SIMD parallelization
> >    However if we pass a List of size 1, it would still work, but possibly
> > not perform as well.
> >
>
> The Vector API is for more than 1. The javadoc examples note that the
> number of lanes is typically 4 or 8. Any elements remaining after
> processing multiples of the lane count will be processed in a sequential
> loop using non-Vector API code, i.e. the method is coded twice. So a
> length=1 array is never going to use the Vector API.
>
> Alex
>

Re: [numbers][gsoc] GSoC 2022 - NUMBERS-186 Proposal

Reply via email to