Alberto,

Moshe was actually on the right track.

Your model does not satisfy enough of the details in the usual Gauss-Markov setup to guarantee unbiasedness of the LS solution.

In particular, the errors must be orthogonal to the design columns for the unbiasedness of the LS solution.

But, in your case, they are not.

This is easily seen when n == 2, m==0, c==0.

Take eps <- rnorm(2)

Then, x = c( 0, eps[1] ), y <- c( eps[1], eps[2] ).

The expectation of y is evidently 0, so the observed y is the error vector.

With basic linear algebra, you can verify that the vector rbind( 1, x ) %*% y has a non-zero expectation and that the regression coefficient for 'x' has negative expectation.

And just for fun:

The correlation when n == 2 is either +1 or -1. You can easily deduce from a simple graphical argument (or equivalent arguments invoking symmetry in the bivariate distribution of eps) that -1 correlations are exactly 3 times as likely as +1 correlations.

HTH,

Chuck


On Mon, 18 Aug 2008, Moshe Olshansky wrote:

Hi Alberto,

Please disregard my previous note - I probably had a black-out!!!


--- On Tue, 19/8/08, Alberto Monteiro <[EMAIL PROTECTED]> wrote:

From: Alberto Monteiro <[EMAIL PROTECTED]>
Subject: [R] A doubt about "lm" and the meaning of its summary
To: r-help@r-project.org
Received: Tuesday, 19 August, 2008, 4:31 AM
I have a conceptual problem (sort of). Maybe there's a
simple solution,
maybe not.

First, let me explain the test case that goes ok.

Let x be a (fixed) n-dimensional vector. I simulate a lot
of linear models,
y = m x + c + error, then I do a lot of regressions. As
expected, the
estimated values of m (let's call them m.bar) are
distributed according to a
Student's t distribution.

More precisely (test case, it takes a few minutes to run):
#
# start with fixed numbers
#
  m <- sample(c(-1, -0.1, 0.1, 1), 1)
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  x <- 1:n # it's possible to use other x
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will
hold the simulation output

#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    y <- m * x + c + sigma * rnorm(n)
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x',
'Estimate']
    m.std.error[i] <-
summary(reg)$coefficients['x', 'Std. Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error is
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))

Now, this goes ok, with ks.test returning a number
uniformly distributed in
the 0-1 interval.

Next, I did a slightly different model. This model is a
reversion model,
where the simulated random variable lp goes along the
equation:
lp[t + 1] <- (1 + m) * lp[t] + c + error

I tried to use the same method as above, defining
x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
y <- m * x + c + error

And now it breaks. Test case (it takes some minutes to
run):

#
# start with fixed numbers
#
  m <- runif(1, -1, 0)  # m must be something in the
(-1,0) interval
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will
hold the simulation output

#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    lp <- 0
    lp[1] <- 0
    for (j in 1:n) {
      lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
    }
    x <- lp[1:n]
    y <- lp[-1] - x
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x',
'Estimate']
    m.std.error[i] <-
summary(reg)$coefficients['x', 'Std. Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error should be
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))

... and now it's not.

What's wrong with this? I suspect that the model y ~ x
does only give
meaningful values when x is deterministic; in this case x
is stochastic. Is
there any correct way to estimate this model giving
meaningful outputs?

Alberto Monteiro

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]                  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to