Re: [math] refactoring least squares

Evan Ward Tue, 25 Feb 2014 05:36:05 -0800

On 02/24/2014 05:47 PM, Gilles wrote:
> On Mon, 24 Feb 2014 14:41:34 -0500, Evan Ward wrote:
>> On 02/24/2014 01:23 PM, Gilles wrote:
>>> On Mon, 24 Feb 2014 11:49:26 -0500, Evan Ward wrote:
>>>> One way to improve performance would be to provide pre-allocated space
>>>> for the Jacobian and reuse it for each evaluation.
>>>
>>> Do you have actual data to back this statement?
>>
>> I did some tests with the CircleVectorial problem from the test cases.
>> The Jacobian is 1000x2, and I ran it 1000 times until hotspot stopped
>> making it run faster. The first column is the current state of the code.
>> The second column is with one less matrix allocation each problem
>> evaluation.
>>
>>        trunk    -1 alloc  %change
>> lu     0.90 s   0.74      -17%
>> chol   0.90     0.74      -17%
>> qr     0.87     0.70      -20%
>>
>>
>> I also see similar reductions in runtime using 1e6 observations and 3
>> observations. We could save 3-4 allocations per evaluation, which
>> extrapolates to 60%-80% in run time.
>
> I would not have expected such a big impact.
> How did you set up the benchmark? Actually, it would be a sane starting
> point to create a "performance" unit test (see e.g.
> "FastMathTestPerformance"
> in package "o.a.c.m.util").


The test I used: https://gist.github.com/wardev/3f183dfaaf297dc2462c

I also modified CircleVectorial to use the MultivariateJacobianFunction
to avoid the translation copies.

>
>>
>>>> The
>>>> LeastSquaresProblem interface would then be:
>>>>
>>>> void evaluate(RealVector point, RealVector resultResiduals, RealVector
>>>> resultJacobian);
>>>>
>>>> I'm interested in hearing your ideas on other approaches to solve this
>>>> issue. Or even if this is an issue worth solving.
>>>
>>> Not before we can be sure that in-place modification (rather than
>>> reallocation) always provides a performance benefit.
>>
>> I would like to hear other ideas for improving the performance.
>
> Design-wise, it is quite ugly to modify input parameters. I think that it
> could also hurt in the long run, by preventing other ways to improve
> performance.

I agree, it smells like fortran. ;)

>
> Why couldn't the reuse vs reallocate be delegated to implementations of
> the "MultivariateJacobianFunction" interface?

That is the change I made to produce the performance numbers. It works
well as long as the application has a single thread. Once it is
multi-threaded we would have to create a new Function for each thread.
I'm reluctant to do that because I think not copying the whole problem
for each thread is a nice feature.

Really I think we need 1 matrix per optimization context. The question
is: How does the Function, which is several method calls below the
optimize() method, access the current optimization context? If we reuse
the matrix at the Function level then we are implicitly tying the
optimization context to the thread context. By passing the matrices as
parameters we make the context explicit, but we clutter up the API.
Perhaps there is a better solution we haven't thought of yet.

> Eventualy, doesn't it boil down to creating "RealVector" and "RealMatrix"
> implementations that modify and return "this" rather than create a new
> object?

Interesting idea. I think this will remove the allocation+copy when
applying weights. That just leaves where matrix decompositions create a
copy of the input matrix.

Best Regards,
Evan

>
>
> Best Regards,
> Gilles
>
>>>
>>>> Evan
>>>
>>
>> On 02/24/2014 12:09 PM, Luc Maisonobe wrote:
>>> Hi Evan,
>>>
>>> Le 24/02/2014 17:49, Evan Ward a écrit :
>>>> I've looked into improving performance further, but it seems any
>>>> further
>>>> improvements will need big API changes for memory management.
>>>>
>>>> Currently using Gauss-Newton with Cholesky (or LU) requires 4 matrix
>>>> allocations _each_ evaluation. The objective function initially
>>>> allocates the Jacobian matrix. Then the weights are applied through
>>>> matrix multiplication, allocating a new matrix. Computing the normal
>>>> equations allocates a new matrix to hold the result, and finally the
>>>> decomposition allocates it's own matrix as a copy. With QR there are 3
>>>> matrix allocations each model function evaluation, since there is no
>>>> need to compute the normal equations, but the third allocation+copy is
>>>> larger. Some empirical sampling data I've collected with the jvisualvm
>>>> tool indicates that matrix allocation and copying takes 30% to 80% of
>>>> the execution time, depending on the dimension of the Jacobian.
>>>>
>>>> One way to improve performance would be to provide pre-allocated space
>>>> for the Jacobian and reuse it for each evaluation. The
>>>> LeastSquaresProblem interface would then be:
>>>>
>>>> void evaluate(RealVector point, RealVector resultResiduals, RealVector
>>>> resultJacobian);
>>>>
>>>> I'm interested in hearing your ideas on other approaches to solve this
>>>> issue. Or even if this is an issue worth solving.
>>> Yes, I think this issue is worth solving, especially since we are going
>>> to ship 3.3 and need to fix as much as possible before the release,
>>> thus
>>> avoiding future problems. Everything spotted now is worth fixing now.
>>>
>>> Your approach seems reasonable, as long as the work arrays are really
>>> allocated at the start of the optimization and shared only through a
>>> few
>>> documented methods like the one you propose. This would mean we can say
>>> in the javadoc that these area should be used only to fulfill the API
>>> requirements and not copied elsewhere, as they *will* be modified as
>>> the
>>> algorithm run, and are explicitly devoted to avoid reallocation. I
>>> guess
>>> this kind of problems is more important when lots of observations are
>>> performed, which correspond to very frequent use case (at least in the
>>> fields I know about).
>>>
>>> For the record, what you propose seems similar to what is done in the
>>> ODE package, as the state vector and its first derivatives are also
>>> kept
>>> in preallocated arrays which are reused throughout the integration and
>>> are used to exchange data between the Apache Commons Math algorithm and
>>> the user problem to be solved. So it is somehting we already do
>>> elsewhere.
>>
>> OK. We could keep the Evaluation interface, which would just reference
>> the pre-allocated residuals and matrix. If the result parameters are
>> null the LSP could allocate a matrix of the correct size automatically.
>> So then the interface would look like:
>>
>> Evaluation evaluate(RealVector point, RealVector resultResiduals,
>> RealVector resultJacobian);
>>
>>>
>>> best regards,
>>> Luc
>>>
>>>> Best Regards,
>>>> Evan
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>
>>
>> On 02/24/2014 01:16 PM, Gilles wrote:
>>> On Mon, 24 Feb 2014 18:09:26 +0100, Luc Maisonobe wrote:
>>>> Hi Evan,
>>>>
>>>> Le 24/02/2014 17:49, Evan Ward a écrit :
>>>>> I've looked into improving performance further, but it seems any
>>>>> further
>>>>> improvements will need big API changes for memory management.
>>>>>
>>>>> Currently using Gauss-Newton with Cholesky (or LU) requires 4 matrix
>>>>> allocations _each_ evaluation. The objective function initially
>>>>> allocates the Jacobian matrix. Then the weights are applied through
>>>>> matrix multiplication, allocating a new matrix. Computing the normal
>>>>> equations allocates a new matrix to hold the result, and finally the
>>>>> decomposition allocates it's own matrix as a copy. With QR there
>>>>> are 3
>>>>> matrix allocations each model function evaluation, since there is no
>>>>> need to compute the normal equations, but the third
>>>>> allocation+copy is
>>>>> larger. Some empirical sampling data I've collected with the
>>>>> jvisualvm
>>>>> tool indicates that matrix allocation and copying takes 30% to 80% of
>>>>> the execution time, depending on the dimension of the Jacobian.
>>>>>
>>>>> One way to improve performance would be to provide pre-allocated
>>>>> space
>>>>> for the Jacobian and reuse it for each evaluation. The
>>>>> LeastSquaresProblem interface would then be:
>>>>>
>>>>> void evaluate(RealVector point, RealVector resultResiduals,
>>>>> RealVector
>>>>> resultJacobian);
>>>>>
>>>>> I'm interested in hearing your ideas on other approaches to solve
>>>>> this
>>>>> issue. Or even if this is an issue worth solving.
>>>>
>>>> Yes, I think this issue is worth solving, especially since we are
>>>> going
>>>> to ship 3.3 and need to fix as much as possible before the release,
>>>> thus
>>>> avoiding future problems. Everything spotted now is worth fixing now.
>>>>
>>>> Your approach seems reasonable, as long as the work arrays are really
>>>> allocated at the start of the optimization and shared only through
>>>> a few
>>>> documented methods like the one you propose. This would mean we can
>>>> say
>>>> in the javadoc that these area should be used only to fulfill the API
>>>> requirements and not copied elsewhere, as they *will* be modified
>>>> as the
>>>> algorithm run, and are explicitly devoted to avoid reallocation. I
>>>> guess
>>>> this kind of problems is more important when lots of observations are
>>>> performed, which correspond to very frequent use case (at least in the
>>>> fields I know about).
>>>>
>>>> For the record, what you propose seems similar to what is done in the
>>>> ODE package, as the state vector and its first derivatives are also
>>>> kept
>>>> in preallocated arrays which are reused throughout the integration and
>>>> are used to exchange data between the Apache Commons Math algorithm
>>>> and
>>>> the user problem to be solved. So it is somehting we already do
>>>> elsewhere.
>>>
>>> If I understand correctly what is being discussed, I do not agree with
>>> this approach.
>>>
>>> The optimization/fitting algorithms must use matrix abstractions.
>>> If performance improvements can achieved, they must happen at the level
>>> of the appropriate matrix implementations.
>>>
>>
>> The matrix abstractions will still be used in the interface. As far as I
>> can tell none of the optimizers or linear algebra classes use the matrix
>> abstractions internally. For example LU, QR, and Cholesky all copy the
>> matrix data to an internal double[][]. I tried computing the normal
>> equation in GaussNewton as j.transpose().multiply(j), but the
>> performance was bad because j.transpose() creates a copy of the matrix.
>> That's why we have the current ugly for loop implementation with
>> getEntry() and setEntry(). Maybe matrix "views" could help solve the
>> issue.
>>
>>>
>>> Best regards,
>>> Gilles
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>
>> Regards,
>> Evan
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] refactoring least squares

Reply via email to