Hello.
On Tue, 16 Sep 2014 18:34:52 -0400, Evan and Maureen Ward wrote:
Hi Gilles, Luc,
Thanks for all the comments. I'll try to respond to the more
fundamental
concerns in this email, and the more practical ones in another email,
if we
decide that we want to include data editing in [math].
[...]
I don't see the that the editing has to occur during the
optimization.
It could be independent:
1. Let the optimization run to completion with all the data
2. Compute the residuals
3a. If there are no outliers, stop
3b. If there are outliers, remove them (or modify their weight)
4. Run the optimization with the remaining data
5. Goto 2
The advantage I see here is that the weight modification is a user
decision (and implementation).
However, IIUC the examples from the link provided by Evan, "robust"
optimization (i.e. handling outliers during optimization) could lead
to a "better" solution.
Would it always be "better"? Not sure: significant points could be
mistaken as outliers and be discarded before the optimizer could
figure out the correct solution...
To avoid a "good but wrong" fit, I'm afraid that we'd have to
introduce several parameters (like "do not discard outliers before
iteration n", "do not discard more than m points", etc.)
The tuning of those parameters will probably not be obvious, and
they will surely complicate the code (e.g. input data like the
"target" and "weight" array won't be "final").
The advantage I see is not correctness (since the algorithm you
outline
will converge correctly), but reducing function evaluations. (I don't
have
data to back up this assertion.) Without inline data editing the
optimization algorithm would "waste" the evaluations between when the
outliers became obvious, and when the optimization converges. With
the
"inline" scheme, the outliers are deleted as soon as possible, and
the
remaining evaluations are used to converge towards the correct
solution.
Converging to a "good but wrong" will always be a risk of any
algorithm
that automatically throws away data. As with our other algorithms,
I'm
expecting the user to know when the algorithm is a good fit for their
problem. The use case I see is when the observations contain mostly
real
data, and a few random numbers. The bad observations can be hard to
identify apriori, but become obvious during the fitting process.
[...]
What works in one domain might not in another.
Thus, the feature should not alter the "standardness", nor decrease
the robustness of the implementation. Caring for special cases
(for which the feature is useful) may be obtained by e.g. use the
standard algorithm as a basic block that is called repeatedly, as
hinted above (and tuning the standard parameters and input data
appropriately for each call).
I was not expecting the response that [math] may not want this
feature. I'm
o.k. with this result since I can implement it as an addition to
[math],
though the API won't be as clean.
IMHO, we cannot assume that fiddling with some of the data points while
the optimization progresses won't alter the correctness of the
solution.
I think that when points are deemed "outliers", e.g. using external
knowledge not available to the optimizer, there should be removed; and
the optimization be redone on the "real" data.
As I understand it, "robust" optimization is not really a good name
(I'd
suggest "fuzzy", or something) because it will indeed assign less
weight
to data points solely on the basis that they are less represented among
the currently available data, irrespective of whether they actually
pertain to the phenomenon being measured.
At first sight, this could allow the optimizer to drag farther and
farther
away from the correct solution (if those points were no outliers).
At first sight, I'd avoid modification of the sizes of input data
(option 2);
from an API usage viewpoint, I imagine that user code will require
additional
"length" tests.
Couldn't the problem you mention in option 1 disappear by having
different
methods that return the a-priori weights and the modified weights?
As we are already much too stringent with our compatibility policy,
I
would allow the case were only *very* advanced users would have
problems. So it would seem fair to me if we can do some changes
were the
users of the factory are unaffected and expert users who decided
the
factory was not good for them have to put new efforts.
Before embarking on this, I would like to see examples where the
"inline"
outlier rejection is leads to a (correct) solution impossible to
achieve
with the approach I've suggested.
As discussed above, I don't see any cases where "inline" outlier
rejection
will result in a less correct solution than the algorithm you
outline.
I do see a potential case, in my current work. ;-)
I do
think we can save a significant number of function evaluations by
using
"inline" outlier rejection.
We can of course talk about a performance-correctness trade-off,
perhaps
useful for cases where the risk is low (lots of data, known expected
rate of
outliers).
IIUC the reference you provided, it's seems that we only need a hook to
allow "outside" modification of the weights (?).
Could it be provided with an interface like the following:
public interface WeightValidator {
/**
* @param weights Current weights.
* @param residuals Current residuals.
* @return the adjusted weights.
RealVector validateWeights(RealVector weights,
RealVector residuals);
}
(similar to the suggestion for MATH-1144)?
Best regards,
Gilles
Best Regards,
Evan
[...]
[1] http://markmail.org/message/e53nago3swvu3t52
https://issues.apache.org/jira/browse/MATH-1105
[2] http://www.mathworks.com/help/curvefit/removing-outliers.html
http://www.mathworks.com/help/curvefit/least-squares-fitting.html
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org