This sort of problem can often be approached from a Bayesian point of view with a result that is a bit more intuitive.
The basic idea for this is that the data are measurements that come from some process that is parameterized. These parameters are sampled from some very non-specific distribution. The question then is given what we have observed, what can we conclude about the likelihood of different parameter values given our knowledge. This leads to a very natural definition of confidence bounds and estimation. The entire approach is anathema to some, but makes lots of intuitive sense vis a vis how normal humans use probability as a concept and has very deep mathematical roots. The philosophical problem can be illustrated easily. If I flip a coin and hold it in my closed hand, you and I would both declare the probability of heads to be 0.5 even though the physics of the situation make it clear that the coin has a single state that simply happens to be unknown to us. If I peek in my hand and we estimate the probabilities again, you will still say 0.5 and I will say 0 or 1, but definitely will not say 0.5. The only change has been that I have gained information and thus probability as we have been using it is clearly a subjective concept. Further, it is admissible to use a probability distribution to describe a physical process that actually only has a single value. This can be extended to some more complex measurement of a physical state where I cannot as easily open my hand. In such a system, each measurement that we make decreases our uncertainty about the unknown state, but does not necessarily eliminate that uncertainty. Treating that unknown state as having a distribution makes no statement about whether the state has a single value. Instead, it merely allows us to quantify our own state of ignorance or, more hopefully, our knowledge. One additionally important point, regardless of whether we want to admit a definition of probability that is a measure of subjective knowledge, with the very weak constraint of exchangeability, it can be shown that we can behave *as*if* this were true and get optimal estimates that can be framed in terms acceptable to frequentists who do not accept probability as subjective. Operationally, this leaves us with the question of how to implement this. Whether the implementation involves sampling or not has no bearing on whether it is correct. If sampling is convenient computationally to provide numerical estimates, then so be it. Likewise, if sampling is convenient for the purposes of testing an approach to see who well the resulting estimates conform to something that we know to be true, then sampling is a great thing. These two kinds of sampling are separate questions from each other and separate questions from how various kinds of estimates are computed and what they mean. So that is where I come from. Now to the problem at hand. The problem of least squares fitting can be described as estimating the parameters of a data generation process given observations. The data generation process in question has a linear relationship between the predictor (independent variables) and the target (dependent) variable. In addition to this linear relationship, there is additive Gaussian noise of unknown magnitude that perturbs the ideal value of the target variable to be the observed value. Generally, we have little preconceived notions about either the linear process or the noise process, but in some cases it is useful to introduce domain knowledge here as a form of regularization. To relate this formulation to commonly used terminology, the accuracy of our estimation of the linear process is referred to as "standard error" and the magnitude of the noise process is referred to as "standard deviation". The accuracy of our estimate is nicely determined by the width of the posterior estimate of the linear process and the magnitude of the noise process is well described by the mean of that parameter over the posterior distribution. For Gaussian noise processes, the values in formulae 34 and 35 are useful estimates of the former in the absence of regularization. For multi-variate problems, it can be very dangerous to estimate the covariance matrix or the inverse of the same by the maximum likelihood estimate since you have an excessive chance of catastrophically bad estimates. There is an extensive literature on Bayesian approaches to this problem. I hope that this description doesn't rub folks the wrong way for being too elementary. I thought it might help to get basic terms in the open since it sounds like there are unstated assumptions in the current discussion. On Sun, May 6, 2012 at 6:15 AM, Dimitri Pourbaix <pourb...@astro.ulb.ac.be>wrote: > Sebastien, > > Hi Dimitri, >> I'm obviously missing something in my litterature review. I did a new >> MC simulation, with a much smaller number of observation points >> (namely 3, to fit a straight line!!!). It turns out that the formula >> you are advocating for is the best estimate of the standard deviation >> of the parameters. Could you please explain why this fomula differs >> from formulas (34) and (35) in >> http://mathworld.wolfram.com/**LeastSquaresFitting.html<http://mathworld.wolfram.com/LeastSquaresFitting.html> >> ? >> > > First thing worth noting is Worlfram is wise enough to call 34 and 35 > standard error ... and not standard deviation! > > As Gilles and you have shown with your MC simulations, the standard > deviation (sigma_i=sqrt(cov[i][i])) approximates by how much the fitted > parameter can vary when several sets of 'observations' are sampled with > the same error distribution. I wrote 'approximate' because the true > standard deviation is not accessible, instead it is approximated as the > inverse of Fisher information matrix which is directly related to the > Hessian matrix. The relation between Fisher and the variance of the > parameter is known as the Rao-Cramer bound. > > In the case of the standard error, the sample of observations is fixed > and one wonders by how much one can change the parameters without > changing the resulting normalized chi square too much. That is the > role of s (eq. 32 on Wolfram). It should be noted that nowhere on > that page there is the notion of error on the observations: the data > are what they are and no alternative sampling should be considered. > > Please, have a look at > > http://en.wikipedia.org/wiki/**Standard_deviation<http://en.wikipedia.org/wiki/Standard_deviation> > http://en.wikipedia.org/wiki/**Standard_error<http://en.wikipedia.org/wiki/Standard_error> > > for further details, especially the last section of the Standard_error > page as it compares std. error and deviation. > > Regards, > Dim. > ------------------------------**------------------------------** > ---------------- > Dimitri Pourbaix * Don't worry, be happy > Institut d'Astronomie et d'Astrophysique * and CARPE DIEM. > CP 226, office 2.N4.211, building NO * > Universite Libre de Bruxelles * Tel : +32-2-650.35.71 > Boulevard du Triomphe * Fax : +32-2-650.42.26 > B-1050 Bruxelles * NAC: HBZSC RG2Z6 > http://sb9.astro.ulb.ac.be/~**pourbaix<http://sb9.astro.ulb.ac.be/~pourbaix> > * mailto: > pourb...@astro.ulb.ac.**be <pourb...@astro.ulb.ac.be> > > ------------------------------**------------------------------**--------- > To unsubscribe, e-mail: > dev-unsubscribe@commons.**apache.org<dev-unsubscr...@commons.apache.org> > For additional commands, e-mail: dev-h...@commons.apache.org > >