On 8/31/11 10:05 PM, Greg Sterijevski wrote: > Hello All, > > This question popped into my head this evening, what is the right way to > handle multiple algorithms which purport to calculate the same thing? There > are, for example, a couple of ways to calculate the student t cdf. What is > the common's philosophy on deciding: > > 1. Whether to allow multiple algorithms. > 2. How an algorithm is included. > a.) Does a 'bug' or shortcoming need to be shown in the current > implementation? > b.) Say that algorithm a works for a numerical range and b works best on > another. Are both included? Is a new 'meta' algorithm written which mixes > both a and b? > 3. Does simplicity count? > 4. Does speed matter? > > A while back, Chris Nix reimplemented the SVD routine. I am not sure I > remember the old routine so I cannot say there was anything worth keeping > there. However was there a conscious decision to scrap it? Why not have it > live side by side with the new one? (Again, I am not saying the old > algorithm was better-Chris' contribution definitely was valuable). I think > we will run into these issues often. > > Thoughts? If this has been discussed already, my apologies.
Well, at a high level, we tried to settle this at the very beginning. Have a look at items 3 and 4 in the "guiding principles" on the [math] main page :) That stuff comes from the original proposal for [math] and we have tried to stay more or less faithful to the principles laid out there. The integration, ode and solvers packages all try to do exactly what you are suggesting - when multiple algorithms, or even variations on an algorithm, exist and no single one can really do the job for all practical use cases, we welcome and incorporate implementations of multiple different ones. How we do that varies by package and this is one place where our "interface pain" has led to some trauma. Initially, we pretty much uniformly separated interfaces from implementation precisely for this reason. You can still see that in the distributions package and while it is a little ironic since we have been talking about collapsing the interfaces into the impls there, the setup now supports multiple different implementations of any of the distributions. I don't want to turn this into an abstract discussion of how best to support the strategy pattern in [math]. I think the "how to do it" part depends on the problem / package. What we have tried to do so far and I think we should keep trying to do is: 0) If there is a single best algorithm that really does work well in almost all cases, make that "the obvious thing" and try to make our implementation of it as robust as possible. If we provide only one implementation, try to make it the best one. I would say put SimpleRegression, or Variance, for example, in this category. 1) When multiple different standard algorithms exist, make sure our API supports adding alternatives - including user-supplied alternatives - in the least-confusing way we can think of. Welcome contributions of the alternatives and try to document as best we can how and when to use the different implementations. Try to stick to standard algorithms, with good external references, so we don't have to turn our javadoc and/or user guide into a numerical analysis textbook. 2) Whenever possible, try to encapsulate the part that varies for different implementations, so the API remains simple and the variable part can be "plugged in." Good examples of this are the RandomGenerators or the Solvers used by different classes. The general question is a good one; but the specific answer depends on the algorithms and classes involved. In the two cases that you mention (t distribution cdf and SVD), we have to balance API complexity and more code to support against practical value. SVD has been a big challenge for us, so I would say we should start at step 0) on that one; but I could easily be talked into supporting multiple impls given strong numerical arguments and multiple good, robust impls. For the t distribution, I am not so sure. A separate thread for that is probably best. My intuition there is that like the other distributions, we and our users are best off with a single impl that does whatever it needs to do in different ranges to provide robust estimates, possibly allowing configuration settings to trade performance for accuracy. Phil > > -Greg > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org