Dear all,
I don't want to argue with anybody about words or about what bootstrap
is suitable for - I know too little for that.
All I need is help to get the *equation coefficients* optimized by
bootstrap - either by one of the functions or by simple median.
Please help,
--
Michal J. Figurski
HUP, Pathology & Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413
Frank E Harrell Jr wrote:
Michal Figurski wrote:
Frank,
"How does bootstrap improve on that?"
I don't know, but I have an idea. Since the data in my set are just a
small sample of a big population, then if I use my whole dataset to
obtain max likelihood estimates, these estimates may be best for this
dataset, but far from ideal for the whole population.
The bootstrap, being a resampling procedure from your sample, has the
same issues about the population as MLEs.
I used bootstrap to virtually increase the size of my dataset, it
should result in estimates more close to that from the population -
isn't it the purpose of bootstrap?
No
When I use such median coefficients on another dataset (another sample
from population), the predictions are better, than using max
likelihood estimates. I have already tested that and it worked!
Then your testing procedure is probably not valid.
I am not a statistician and I don't feel what "overfitting" is, but it
may be just another word for the same idea.
Nevertheless, I would still like to know how can I get the coeffcients
for the model that gives the "nearly unbiased estimates". I greatly
appreciate your help.
More info in my book Regression Modeling Strategies.
Frank
--
Michal J. Figurski
HUP, Pathology & Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413
Frank E Harrell Jr wrote:
Michal Figurski wrote:
Hello all,
I am trying to optimize my logistic regression model by using
bootstrap. I was previously using SAS for this kind of tasks, but I
am now switching to R.
My data frame consists of 5 columns and has 109 rows. Each row is a
single record composed of the following values: Subject_name,
numeric1, numeric2, numeric3 and outcome (yes or no). All three
numerics are used to predict outcome using LR.
In SAS I have written a macro, that was splitting the dataset,
running LR on one half of data and making predictions on second
half. Then it was collecting the equation coefficients from each
iteration of bootstrap. Later I was just taking medians of these
coefficients from all iterations, and used them as an optimal model
- it really worked well!
Why not use maximum likelihood estimation, i.e., the coefficients
from the original fit. How does the bootstrap improve on that?
Now I want to do the same in R. I tried to use the 'validate' or
'calibrate' functions from package "Design", and I also experimented
with function 'sm.binomial.bootstrap' from package "sm". I tried
also the function 'boot' from package "boot", though without success
- in my case it randomly selected _columns_ from my data frame,
while I wanted it to select _rows_.
validate and calibrate in Design do resampling on the rows
Resampling is mainly used to get a nearly unbiased estimate of the
model performance, i.e., to correct for overfitting.
Frank Harrell
Though the main point here is the optimized LR equation. I would
appreciate any help on how to extract the LR equation coefficients
from any of these bootstrap functions, in the same form as given by
'glm' or 'lrm'.
Many thanks in advance!
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.