Re: [math] Least Squares Outlier Rejection

Gilles Wed, 17 Sep 2014 06:53:32 -0700

Hello.

On Tue, 16 Sep 2014 18:34:52 -0400, Evan and Maureen Ward wrote:

Hi Gilles, Luc,
Thanks for all the comments. I'll try to respond to the morefundamentalconcerns in this email, and the more practical ones in another email,if we
decide that we want to include data editing in [math].

[...]

I don't see the that the editing has to occur during theoptimization.

It could be independent:
 1. Let the optimization run to completion with all the data
 2. Compute the residuals
 3a. If there are no outliers, stop
 3b. If there are outliers, remove them (or modify their weight)
 4. Run the optimization with the remaining data
 5. Goto 2

The advantage I see here is that the weight modification is a user
decision (and implementation).

However, IIUC the examples from the link provided by Evan, "robust"
optimization (i.e. handling outliers during optimization) could lead
to a "better" solution.

Would it always be "better"? Not sure: significant points could be
mistaken as outliers and be discarded before the optimizer could
figure out the correct solution...
To avoid a "good but wrong" fit, I'm afraid that we'd have to
introduce several parameters (like "do not discard outliers before
iteration n", "do not discard more than m points", etc.)
The tuning of those parameters will probably not be obvious, and
they will surely complicate the code (e.g. input data like the
"target" and "weight" array won't be "final").

The advantage I see is not correctness (since the algorithm yououtlinewill converge correctly), but reducing function evaluations. (I don'thave

data to back up this assertion.) Without inline data editing the
optimization algorithm would "waste" the evaluations between when the

outliers became obvious, and when the optimization converges. Withthe"inline" scheme, the outliers are deleted as soon as possible, andtheremaining evaluations are used to converge towards the correctsolution.

Converging to a "good but wrong" will always be a risk of anyalgorithmthat automatically throws away data. As with our other algorithms,I'm

expecting the user to know when the algorithm is a good fit for their

problem. The use case I see is when the observations contain mostlyreal

data, and a few random numbers. The bad observations can be hard to
identify apriori, but become obvious during the fitting process.

[...]


What works in one domain might not in another.
Thus, the feature should not alter the "standardness", nor decrease
the robustness of the implementation. Caring for special cases
(for which the feature is useful) may be obtained by e.g. use the
standard algorithm as a basic block that is called repeatedly, as
hinted above (and tuning the standard parameters and input data
appropriately for each call).

I was not expecting the response that [math] may not want thisfeature. I'mo.k. with this result since I can implement it as an addition to[math],

though the API won't be as clean.


IMHO, we cannot assume that fiddling with some of the data points while

the optimization progresses won't alter the correctness of thesolution.


I think that when points are deemed "outliers", e.g. using external
knowledge not available to the optimizer, there should be removed; and
the optimization be redone on the "real" data.

As I understand it, "robust" optimization is not really a good name(I'dsuggest "fuzzy", or something) because it will indeed assign lessweight

to data points solely on the basis that they are less represented among
the currently available data, irrespective of whether they actually
pertain to the phenomenon being measured.

At first sight, this could allow the optimizer to drag farther andfarther

away from the correct solution (if those points were no outliers).

At first sight, I'd avoid modification of the sizes of input data
(option 2);
from an API usage viewpoint, I imagine that user code will require
additional
"length" tests.

Couldn't the problem you mention in option 1 disappear by having
different
methods that return the a-priori weights and the modified weights?
As we are already much too stringent with our compatibility policy,I
would allow the case were only *very* advanced users would have
problems. So it would seem fair to me if we can do some changeswere theusers of the factory are unaffected and expert users who decidedthe
factory was not good for them have to put new efforts.
Before embarking on this, I would like to see examples where the"inline"outlier rejection is leads to a (correct) solution impossible toachieve
with the approach I've suggested.
As discussed above, I don't see any cases where "inline" outlierrejectionwill result in a less correct solution than the algorithm yououtline.


I do see a potential case, in my current work. ;-)

I do
think we can save a significant number of function evaluations byusing
"inline" outlier rejection.

We can of course talk about a performance-correctness trade-off,perhapsuseful for cases where the risk is low (lots of data, known expectedrate of

outliers).

IIUC the reference you provided, it's seems that we only need a hook to
allow "outside" modification of the weights (?).
Could it be provided with an interface like the following:

public interface WeightValidator {
    /**
     * @param weights Current weights.
     * @param residuals Current residuals.
     * @return the adjusted weights.
    RealVector validateWeights(RealVector weights,
                               RealVector residuals);
}

(similar to the suggestion for MATH-1144)?


Best regards,
Gilles


Best Regards,
Evan

[...]

[1] http://markmail.org/message/e53nago3swvu3t52
     https://issues.apache.org/jira/browse/MATH-1105
[2] http://www.mathworks.com/help/curvefit/removing-outliers.html

http://www.mathworks.com/help/curvefit/least-squares-fitting.html



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] Least Squares Outlier Rejection

Reply via email to