This is explained in the "Details" section of the help page for partialPlot.
Best
Andy
> -Original Message-
> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jesús Para
> Fernández
> Sent: Tuesday, April 12, 2016 1:17 AM
> To: r-help@r-project.org
> Subject: [R] Random For
Hi Sonja,
How did you build the rpart tree (i.e., what settings did you use in
rpart.control)? Rpart by default will use cross validation to prune back the
tree, whereas RF doesn't need that. There are other more subtle differences as
well. If you want to compare single tree results, you rea
If you are using the code, that's not really using randomForest directly. I
don't understand the data structure you have (since you did not show anything)
so can't really tell you much. In any case, that warning came from
randomForest() when it is run in regression mode but the response has fe
You can try something like this:
http://pubs.acs.org/doi/abs/10.1021/ci050022a
Basically similar idea to what is done in random forests: permute predictor
variable one at a time and see how much that degrades prediction performance.
Cheers,
Andy
-Original Message-
From: r-help-boun...@r
Yes, that's part of the intention anyway. One can also use them to do
clustering.
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Massimo Bressan
Sent: Monday, December 02, 2013 6:34 AM
To: r-help@r-project.org
Subject
#2 can be done simply with predict(fmi, type="prob"). See the help page for
predict.randomForest().
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of arun
Sent: Tuesday, November 26, 2013 6:57 PM
To: R help
Subject: Re:
Classification trees use the Gini index, whereas the regression trees use sum
of squared errors. They are "hard-wired" into the C/Fortran code, so not
easily changeable.
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of
The difference is importance(..., scale=TRUE). See the help page for detail.
If you extract the $importance component from a randomForest object, you do not
get the scaling.
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Beha
Use KernSmooth (one of the recommended packages that are included in R
distribution). E.g.,
> library(KernSmooth)
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> x <- seq(0, 1, length=201)
> y <- 4 * cos(2*pi*x) + rnorm(x)
> f <- locpoly(x, y, degree=0, kernel="epan", bandwidth=.1)
> plo
It needs to be done "by hand", in that partialPlot() does not handle more than
one variable at a time. You need to modify its code to do that (and be ready
to wait even longer, as it can be slow).
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-proj
Without data to reproduce what you saw, we can only guess.
One possibility is due to tie-breaking. There are several places where ties
can occur and are broken at random, including at the prediction step. One
difference between the two ways of doing prediction is that when it's all done
withi
Try the following:
set.seed(100)
rf1 <- randomForest(Species ~ ., data=iris)
set.seed(100)
rf2 <- randomForest(iris[1:4], iris$Species)
object.size(rf1)
object.size(rf2)
str(rf1)
str(rf2)
You can try it on your own data. That should give you some hints about why the
formula interface should be
Not unless we have more information. Please read the Posting Guide to see how
to make it easier for people to answer your question.
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Oritteropus
Sent: Thursday, November 2
How about taking the combination of the two? E.g., gamma = factor(paste(alpha,
beta1, sep=":")) and use gamma as the response.
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Gyanendra Pokharel
Sent: Tuesday, October
1. Not sure what you want. What "details" are you looking for exactly? If
you call predict(trainset) without the newdata argument, you will get the
(out-of-bag) prediction of the training set, which is exactly the "predicted"
component of the RF object.
2. If you set type="votes" and norm.v
The "type=1" importance measure in RF compares the prediction error of each
tree on the OOB data with the prediction error of the same tree on the OOB data
with the values of one variable randomly shuffled. If the variable has no
predictive power, then the two should be very close, and there's
Yes, you need to modify both the R and the underlying C code. It's the the
source package on CRAN (the .tar.gz file).
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Josh Browning
Sent: Friday, June 01, 2012 10:48 AM
To: r
As long as you can remember that the summaries such as variable importance, OOB
predictions, and OOB error rates are not applicable, I think that should be
fine.
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Nikita Desai
Hi Kelly,
The function has a limitation that it cannot handle any column in your "x" that
is a categorical variable with more than 32 categories. One possibility is to
see if you can "bin" some of the categories into one to get below 32 categories.
Andy
-Original Message-
From: r-hel
That's not how RF works at all. The setting of mtry is irrelevant to this.
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of matt
Sent: Monday, May 14, 2012 10:22 AM
To: r-help@r-project.org
Subject: Re: [R] Random forests pre
It doesn't: You just get an error if there are NAs in the data; e.g.,
R> rf1 = randomForest(iris[1:4], iris[[5]])
R> predict(rf1, newdata=data.frame(Sepal.Length=1, Sepal.Width=2,
Petal.Length=3, Petal.Width=NA))
Error in predict.randomForest(rf1, newdata = data.frame(Sepal.Length = 1, :
mis
I don't think this is so hard to explain. If you evaluate AUC using either OOB
prediction or on a test set (or something like CV or bootstrap), that would be
what I expect for most data. When you add more variables (that are, say, less
informative) to a model, the model has to look harder to f
Note that the partialPlot() function also returns the x-y pairs being plotted,
so you can work from there if you wish. As to SD, my guess is you want some
sort of confidence interval or band around the curve? I do not know of any
theory to produce that, but that may well just be my ignorance.
Without seeing your code, it's hard to say much more, but do avoid using
formula when you have large data.
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Jason & Caroline Shaw
Sent: Friday, April 06, 2012 1:20 PM
To: jim ho
Please read the help page for the partialPlot() function and make sure you
learn about all its arguments (in particular, "which.class").
Andy
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of jmc
Sent: Wednesday, April 11, 2012 2:4
Alternatively, use only a subset to run loess(), either a random sample or
something like every other k-th (sorted) data value, or the quantiles. It's
hard for me to imagine that that many data points are going to improve your
model much at all (unless you use tiny span).
Andy
From: r-help-b
Don't know how you searched, but perhaps this might help:
https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of Jenn Barrett
> Sent: Tuesday, April 03, 2012 1:23 AM
> To
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of Saruman
>
> I dont see how this answered the original question of the poster.
>
> He was quite clear: the value of the predictions coming out
> of RF do not
> match what comes out of the predict function u
Sam,
As you've probably seen, all the MDSplot() function does is feed 1 - proximity
to the cmdscale() function. Some suggestion and clarification:
1. If all you want is the proximity matrix, you can run randomForest() with
keep.forest=FALSE to save memory. You will likely want to run somewhat
I believe you are expecting the software to do what it did not claim being able
to do. predict.locfit() does not have a "type" argument, nor can that take on
"terms". When you specify two variables in the smooth, a bivariate smooth is
done, so you get one bivariate smooth function, not the sum
The Biometrics Research department at the Merck Research Laboratories has an
open position to be located in Rahway, New Jersey, USA:
This position will be responsible for imaging and bio-signal biomarkers
projects including analysis of preclinical, early clinical, and experimental
medicine imag
The way to represent categorical variables is with factors. See ?factor.
randomForest() will handle factors appropriately, as most modeling functions in
R.
Andy
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of abhishek
>
Just using the reshape() function in base R:
df.long = reshape(df, varying=list(names(df)[4:7]), direction="long")
This also gives two extra columns ("time" and "id") can can be dropped.
Andy
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.
That's why I said you need the book. The details are all in the book.
From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 1:49 PM
To: Liaw, Andy
Cc: r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with
__
From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 10:06 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with
auto-bandwidth?
Thank you Andy!
I went thru KernSmooth package but I don't se
ok to get most
mileage out of it though.
Andy
From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with
auto-bandwidth?
Bert's question aside (I was going to ask about laundry, but that's much harder
than taxes...), my understanding of the situation is that "optimal" is in the
eye of the beholder. There were at least two schools of thought on which is
the better way of automatically selecting bandwidth, using pl
name in X)
>
> Hi Andy,
>
> On Tuesday, January 31, 2012 08:44:13 AM Liaw, Andy wrote:
> > I'm not exactly sure if this is a problem with indexing by
> name; i.e., is
> > the following behavior by design? The problem is that
> names or dimnames
> > that ar
There's an alternative, but it may not be any more efficient in time or
memory...
You can run predict() on the training set once, setting nodes=TRUE. That will
give you a n by ntree matrix of which node of which tree the data point falls
in. For any new data, you would run predict() with node
You should be able to use the Rgui menu to install packages.
Andy
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of Niratha
> Sent: Wednesday, February 01, 2012 5:16 AM
> To: r-help@r-project.org
> Subject: [R] Random Forest P
I'm not exactly sure if this is a problem with indexing by name; i.e., is the
following behavior by design? The problem is that names or dimnames that are
empty seem to be treated differently, and one can't index by them:
R> junk = 1:3
R> names(junk) = c("a", "b", "")
R> junk
a b
1 2 3
R> j
The reason that it's not implemented is because of computational cost. Some
users had done it on their own using the same idea. It's just that it takes
too much memory for even moderately sized data. It can be done much more
efficiently in MART because computational shortcuts were used.
Be
Variable section is part of the training process-- it chooses the model. By
definition, test data is used only for testing (evaluating chosen model).
If you find a package or function that does variable selection on test data,
run from it!
Best,
Andy
> -Original Message-
> From: r-he
See the gss package on CRAN.
Andy
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of ali_protocol
> Sent: Monday, January 09, 2012 7:13 AM
> To: r-help@r-project.org
> Subject: [R] What is the function for "smoothing splines wi
You should see no differences beyond what you'd get by running RF a second time
with a different random number seed.
Best,
Andy
From: gianni lavaredo [mailto:gianni.lavar...@gmail.com]
Sent: Monday, December 05, 2011 2:19 PM
To: Liaw, Andy
Cc: r-h
Tree based models (such as RF) are invriant to monotonic transformations in the
predictor (x) variables, because they only use the ranks of the variables, not
their actual values. More specifically, they look for splits that are at the
mid-points of unique values. Thus the resulting trees are
The first version of the package was created by re-writing the main program in
the original Fortran as C, and calls other Fortran subroutines that were mostly
untouched, so dynamic memory allocation can be done. Later versions have most
of the Fortran code translated/re-written in C. Currently
Not only that, but in the same help page, same "Value" section, it says:
predicted the predicted values of the input data based on out-of-bag
samples
so people really should read the help pages instead of speculate...
If the error rates were not based on OOB samples, they would drop to (
Gianni,
You should not "tune" ntree in cross-validation or other validation methods,
and especially should not be using OOB MSE to do so.
1. At ntree=1, you are using only about 36% of the data to assess the
performance of a single random tree. This number can vary wildly. I'd say
don't both
Hi Dongli,
Questions about usage of specific contributed packages are best directed toward
the package maintainer/author first, as they are likely the best sources of
information, and they don't necessarily subscribe to or keep up with the daily
deluge of R-help messages.
(In this particular c
You are not giving anyone much to go on. Please read the posting guide and see
how to ask your question in a way that's easier for others to answer. At the
_very_ least, show what commands you used, what your data looks like, etc.
Andy
> -Original Message-
> From: r-help-boun...@r-pr
The current "classwt" option in the randomForest package has been there since
the beginning, and is different from how the official Fortran code (version 4
and later) implements class weights. It simply account for the class weights
in the Gini index calculation when splitting nodes, exactly as
It looks like you are building a regression model. With such a large number of
rows, you should try to limit the size of the trees by setting nodesize to
something larger than the default (5). The issue, I suspect, is the fact that
the size of the largest possible tree has about 2*nodesize nod
You really need to follow the suggestions in the posting guide to get the best
help from this list.
Which versions of randomForest are you using in S-PLUS and R? Which version of
R are you using? When you restore the object into R, what does str(object)
say? Have you also tried dump()/sour
See if the following is close to what you're looking for. If not, please give
more detail on what you want to do.
data(airquality)
airquality <- na.omit(airquality)
set.seed(131)
ozone.rf <- randomForest(Ozone ~ ., airquality, importance=TRUE)
imp <- importance(ozone.rf) # get the importance me
Has anyone suggested mosaic displays? That's the closest I can think of as a
"square pie chart"...
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of Naomi Robbins
> Sent: Sunday, July 24, 2011 7:09 AM
> To: Thomas Levine
> Cc
From: Prof Brian Ripley
>
> Hmm, load() does have an 'envir' argument. So you could simply use
> that and with() (which is pretty much what attach() does internally).
>
> If people really wanted a lazy approach, with() could be extended to
> allow file names (as attach does).
I'm not sure if
I don't have access to that article, but just reading the abstract, it
should be quite easy to do by writing a wrapper function that calls
randomForest(). I've done so with random projections before. One
limitation to methods like these is that they only apply to all numeric
data.
Andy
> -
Is something like this what you're looking for?
R> library(nor1mix)
R> nmix2 <- norMix(c(2, 3), sig2=c(25, 4), w=c(.2, .8))
R> dnorMix(1, nmix2) - dnorm(1, 2, 5)
[1] 0.03422146
Andy
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Beha
Just to add my ever depreciating $0.02 USD:
Keep in mind that the significance testing paradigm puts a constraint on
false positive rate, and let false negative rate float. What you should
consider is whether that makes sense in your situation. All too often
this is not carefully considered, and
As far as I can tell, Uwe is not even fitting a model, but instead just
solving a nonlinear equation, so I don't know why he wants a R^2. I
don't see a statistical model here, so I don't know why one would want a
statistical measure.
Andy
> -Original Message-
> From: r-help-boun...@r-pr
You need to use "==" instead of "=" for testing equality. While you're at it,
you should check for positive values, not just screening out 0s. This works
for me:
R> mydata = data.frame(x=0:10, y=runif(11))
R> fm = lm(y ~ log(x), mydata, subset=x>0)
Andy
> -Original Message-
> From:
Exactly as Max said. See the rfcv() function in the latest version of
randomForest, as well as the reference in the help page for that function.
OOB estimate is as accurate as CV estimate _if_ you run straight RF. Most
other methods do not have this "feature". However, if you start adding ste
Locfit() in the locfit package has a slightly more modern implementation of
loess, and is much more flexible in that it has a lot of options to tweak. One
such option is the kernel. There are seven to choose from.
Andy
From: wisdomtooth
>
> >From what I understand, loess in R uses the stand
Check out caret::varImp.rpart(). It's described in the original CART
book.
Andy
From: Tal Galili
>
> Hello all,
>
> When building a CART model (specifically classification tree)
> using rpart,
> it is sometimes interesting to know what is the importance of
> the various
> variables introduc
LP
> 35 Gatehouse Drive
> Waltham, MA 02451
> USA
> 781-839-4304
> ryszard.czermin...@astrazeneca.com
>
> RE: [R] randomForest: too many element specified?
> Liaw, Andy
> Mon, 17 Jan 2005 05:56:28 -0800
> > From: luk
> >
> > When I run randonForest wi
I was communicating with Kevin off-list.
The problem seems to be run time, not install time. News() calls
tools:::.build_news_db(), and the 2nd line of that function is:
nfile <- file.path(dir, "inst", "NEWS.Rd")
and that's the problem: an installed package shouldn't have an inst/
subdirector
From: Liaw, Andy
>
> Note that that isn't exactly what I recommended. If you look at the
> example in the help page for combine(), you'll see that it is
> combining
> RF objects trained on the same data; i.e., instead of having
> one RF with
> 500 trees, you can
Note that that isn't exactly what I recommended. If you look at the
example in the help page for combine(), you'll see that it is combining
RF objects trained on the same data; i.e., instead of having one RF with
500 trees, you can combine five RFs trained on the same data with 100
trees each into
If you have multiple cores, one "poor man's solution" is to run separate
forests in different R sessions, save the RF objects, load them into the
same session and combine() them. You can do this less clumsily if you
use things like Rmpi or other distributed computing packages.
Another considerati
combine() is meant to be used on randomForest objects that were built
from identical training data.
Andy
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of Dennis Duro
> Sent: Friday, December 10, 2010 11:59 PM
> To: r-help@r-pr
The order in the output correspond to the order of the input. I will
patch the code so that it grabs the row names of the input (if exist).
If you specify type="prob", it already labels the rows by the input row
names.
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:
data you want to predict, not the other way around.
Andy
> -Original Message-
> From: Deschamps, Benjamin [mailto:benjamin.descha...@agr.gc.ca]
> Sent: Tuesday, November 16, 2010 11:16 AM
> To: r-help@r-project.org
> Cc: Liaw, Andy
> Subject: RE: [R] randomForest pa
Please show us the code you used to run randomForest, the output, as
well as what you get with other algorithms (on the same random subset
for comparison). I have yet to see a dataset where randomForest does
_far_ worse than other methods.
Andy
> -Original Message-
> From: r-help-boun..
Job: Scientific programmer at Merck, Biostatistics, Rahway, NJ, USA
[Job Description]
This position works closely with statisticians to process and analyze
ultrasound, MRI, and radiotelemetry longitudinal studies using a series
of programs developed in R and Mathworks/Matlab. This position provid
The caret package has answers to all your questions.
> -Original Message-
> From: r-help-boun...@r-project.org
> [mailto:r-help-boun...@r-project.org] On Behalf Of Neeti
> Sent: Tuesday, October 26, 2010 10:42 AM
> To: r-help@r-project.org
> Subject: [R] to determine the variable importa
ing).
> >>
> >> For example, k nearest neighbors are not known to over
> fit, but a 1nn
> >> model will always perfectly predict the training data.
> >>
> >> Max
> >>
> >> On Oct 23, 2010, at 9:05 AM, "Liaw,
> Andy" wro
What Breiman meant is that as the model gets more complex (i.e., as the
number of trees tends to infinity) the geneeralization error (test set
error) does not increase. This does not hold for boosting, for example;
i.e., you can't "boost forever", which nececitate the need to find the
optimal numb
Let me expand on what Max showed.
For the most part, performance on training set is meaningless. (That's
the case for most algorithms, but especially so for RF.) In the default
(and recommended) setting, the trees are grown to the maximum size,
which means that quite likely there's only one data
From: Michael Lindgren
>
> Greetings R Users!
>
> I am posting to inquire about the proximity matrix in the randomForest
> R-package. I am having difficulty pushing very large data through the
> algorithm and it appears to hang on the building of the prox
> matrix. I have
> read on Dr. Breiman
The plot titles aren't pretty, but the following works for me:
R> library(randomForest)
randomForest 4.5-37
Type rfNews() to see new features/changes/bug fixes.
R> set.seed(1004)
R> iris.rf <- randomForest(iris[-5], iris[[5]], ntree=1001)
R> par(mfrow=c(2,2))
R> for (i in 1:4) partialPlot(iris.rf,
In a partial dependence plot, only the relative scale, not absolute
scale, of the y-axis is meaningful. I.e., you can compare the range of
the curves between partial dependence plots of two different variables,
but not the actual numbers on the axis. The range is compressed
compared to the origin
One possibility:
R> f = function(x, f) eval(as.call(list(as.name(f), x)))
R> f(1:10, "mean")
[1] 5.5
R> f(1:10, "max")
[1] 10
Andy
From: Jonathan Greenberg
> R-helpers:
>
> If I want to pass a character name of a function TO a
> function, and then
> have that function executed, how would I do
> From: Vijayan Padmanabhan
>
> Dear R Group
> I had an observation that in some cases, when I use the
> randomForest model
> to create partialPlot in R using the package "randomForest"
> the y-axis displays values that are more than -1!
> It is a classification problem that i was trying to addr
> From: jlu...@ria.buffalo.edu
>
> Clearly inferior treatments are unethical.
The Big Question is: What constitute "clearly"? Who or How to decide
what is "clearly"? I'm sure there are plenty of people who don't
understand much Statistics and are perfectly willing to say the results
on the tw
For Python, check out the project "orange":
http://www.ailab.si/orange/doc/catalog/Classify/ClassificationTree.htm
Not sure about C++, but OpenDT is in C:
http://opendt.sourceforge.net/
Looks like OpenCV has both Python and C++ interface (didn't see Python interace
to decision tree, though):
htt
You're not giving us much to go on, so the info I can give is
correspondingly vague.
I take it you are using RF in "unsupervised" mode. What RF does in this
case is simply generate a second part of the data that have the same
marginal distribution as the data you have, but the variables are
indep
Job description: Computational statistician/biometrician
The Biometrics Research Department at Merck Research Laboratories, Merck
& Co., Inc. in Rahway, NJ, is seeking a highly motivated
statistician/data analyst to work in its basic research, drug discovery,
preclinical and early clinical develo
From: Philipp Pagel
>
> In a current project, I am fitting loess models to subsets of data in
> order to use the loess predicitons for normalization (similar to what
> is done in many microarray analyses). While working on this I ran into
> a problem when I tried to predict from the loess models a
From: Stephen Liu
>
> Hi JesperHybel,
>
> Thanks for your advice.
>
> >If you're trying to follow the youtube video you have a
> typing mistake here:
>
> >InsectSprays.aov <-(test01$count ~ test01$spray)
>
> >I think this should be:
>
> >InsectSprays.aov <-aov(test01$count ~ test01$spray)
>
From: Stephen Liu
>
> Hi folks,
>
> R on Ubuntu 10.04 64 bit.
>
> Performed following steps on R:-
>
> ### to access to the object
> > data(InsectSprays)
>
> ### create a .csv file
> > write.csv(InsectSprays, "InsectSpraysCopy.csv")
>
>
> On another terminal
> $ sudo updatedb
> $ locate Inse
From: Pierre Dubath
>
> Hello,
>
> I am using the R randomForest package to classify variable
> stars. I have
> a training set of 1755 stars described by (too) many
> variables. Some of
> these variables are highly correlated.
>
> I believe that I understand how randomForest works and how
>
Seems to me it may be worth stating what may be elementary to some on this list:
- If all relevant variables are included in the model and the "true model" is
indeed linear, then all least squares estimated coefficients are unbiased. [
David Ruppert once said about the three kinds of lies: Lie
If the collinearity you're seeing arose from the addition of a product
(interaction) term, I do not think penalization is the best answer.
What is the goal of your analysis? If it's prediction, then I wouldn't
worry about this type of collinearity. If you're interested in
inference, I'd try some
As a matter of fact, I would say both Bert and I encounter "designed
experiments" a lot more than "observational studies", yet we speak from
experience that those things that Bert mentioned happen on a daily
basis. When you talk to experimenters, ask your questions carefully and
you'll see these t
There's a bug in the code. If you add row names to the X matrix befor
you call randomForest(), you'd get:
R> summary (outlier(mdl.rf) )
Min. 1st Qu. MedianMean 3rd Qu.Max.
-1.0580 -0.5957 0. 0.6406 1.2650 9.5200
I'll fix this in the next release. Thanks for reporting.
Bes
I'll incorporate some of these ideas into the next release. Thanks!
Best,
Andy
-Original Message-
From: h.wick...@gmail.com [mailto:h.wick...@gmail.com] On Behalf Of Hadley
Wickham
Sent: Thursday, July 01, 2010 8:08 PM
To: Mike Williamson
Cc: Liaw, Andy; r-help
Subject: Re: [R] a
roughfix(x))
user system elapsed
8.440.398.85
R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with
2GB ram.
Andy
From: Mike Williamson [mailto:this.is@gmail.com]
Sent: Thursday, July 01, 2010 12:48 PM
To: Liaw, Andy
Cc: r-h
You have not shown any code on exactly how you use na.roughfix(), so I
can only guess.
If you are doing something like:
randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
I would not be surprised that it's taking very long on large datasets.
Most likely it's caused by the formula inter
cobler_squad needs more basic help than doing lda. The data input just
doesn't make sense.
If vowel_feature is a data frame, than G <- vowel_feature[15] creates
another data frame containing the 15th variable in vowel_feature, so "G"
is the name of a data frame, not a variable in a data frame.
1 - 100 of 241 matches
Mail list logo