Re: [R] Random Forest classification

2016-04-18 Thread Liaw, Andy
This is explained in the "Details" section of the help page for partialPlot. Best Andy > -Original Message- > From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jesús Para > Fernández > Sent: Tuesday, April 12, 2016 1:17 AM > To: r-help@r-project.org > Subject: [R] Random For

Re: [R] rpart and randomforest results

2014-04-07 Thread Liaw, Andy
Hi Sonja, How did you build the rpart tree (i.e., what settings did you use in rpart.control)? Rpart by default will use cross validation to prune back the tree, whereas RF doesn't need that. There are other more subtle differences as well. If you want to compare single tree results, you rea

Re: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression?

2014-03-24 Thread Liaw, Andy
If you are using the code, that's not really using randomForest directly. I don't understand the data structure you have (since you did not show anything) so can't really tell you much. In any case, that warning came from randomForest() when it is run in regression mode but the response has fe

Re: [R] Variable importance - ANN

2013-12-04 Thread Liaw, Andy
You can try something like this: http://pubs.acs.org/doi/abs/10.1021/ci050022a Basically similar idea to what is done in random forests: permute predictor variable one at a time and see how much that degrades prediction performance. Cheers, Andy -Original Message- From: r-help-boun...@r

Re: [R] interpretation of MDS plot in random forest

2013-12-02 Thread Liaw, Andy
Yes, that's part of the intention anyway. One can also use them to do clustering. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Massimo Bressan Sent: Monday, December 02, 2013 6:34 AM To: r-help@r-project.org Subject

Re: [R] How do I extract Random Forest Terms and Probabilities?

2013-12-02 Thread Liaw, Andy
#2 can be done simply with predict(fmi, type="prob"). See the help page for predict.randomForest(). Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of arun Sent: Tuesday, November 26, 2013 6:57 PM To: R help Subject: Re:

Re: [R] Split type in the RandomForest package

2013-11-20 Thread Liaw, Andy
Classification trees use the Gini index, whereas the regression trees use sum of squared errors. They are "hard-wired" into the C/Fortran code, so not easily changeable. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of

Re: [R] What is the difference between Mean Decrease Accuracy produced by importance(foo) vs foo$importance in a Random Forest Model?

2013-11-19 Thread Liaw, Andy
The difference is importance(..., scale=TRUE). See the help page for detail. If you extract the $importance component from a randomForest object, you do not get the scaling. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Beha

Re: [R] FW: Nadaraya-Watson kernel

2013-11-07 Thread Liaw, Andy
Use KernSmooth (one of the recommended packages that are included in R distribution). E.g., > library(KernSmooth) KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009 > x <- seq(0, 1, length=201) > y <- 4 * cos(2*pi*x) + rnorm(x) > f <- locpoly(x, y, degree=0, kernel="epan", bandwidth=.1) > plo

Re: [R] Creating 3d partial dependence plots

2013-03-20 Thread Liaw, Andy
It needs to be done "by hand", in that partialPlot() does not handle more than one variable at a time. You need to modify its code to do that (and be ready to wait even longer, as it can be slow). Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-proj

Re: [R] Different results from random.Forest with test option and using predict function

2012-12-04 Thread Liaw, Andy
Without data to reproduce what you saw, we can only guess. One possibility is due to tie-breaking. There are several places where ties can occur and are broken at random, including at the prediction step. One difference between the two ways of doing prediction is that when it's all done withi

Re: [R] How do I make R randomForest model size smaller?

2012-12-04 Thread Liaw, Andy
Try the following: set.seed(100) rf1 <- randomForest(Species ~ ., data=iris) set.seed(100) rf2 <- randomForest(iris[1:4], iris$Species) object.size(rf1) object.size(rf2) str(rf1) str(rf2) You can try it on your own data. That should give you some hints about why the formula interface should be

Re: [R] Partial dependence plot in randomForest package (all flat responses)

2012-11-26 Thread Liaw, Andy
Not unless we have more information. Please read the Posting Guide to see how to make it easier for people to answer your question. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Oritteropus Sent: Thursday, November 2

Re: [R] Random Forest for multiple categorical variables

2012-10-17 Thread Liaw, Andy
How about taking the combination of the two? E.g., gamma = factor(paste(alpha, beta1, sep=":")) and use gamma as the response. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Gyanendra Pokharel Sent: Tuesday, October

Re: [R] Random Forest - Extract

2012-10-03 Thread Liaw, Andy
1. Not sure what you want. What "details" are you looking for exactly? If you call predict(trainset) without the newdata argument, you will get the (out-of-bag) prediction of the training set, which is exactly the "predicted" component of the RF object. 2. If you set type="votes" and norm.v

Re: [R] interpret the importance output?

2012-08-29 Thread Liaw, Andy
The "type=1" importance measure in RF compares the prediction error of each tree on the OOB data with the prediction error of the same tree on the OOB data with the values of one variable randomly shuffled. If the variable has no predictive power, then the two should be very close, and there's

Re: [R] Stratified Sampling with randomForest Regression

2012-06-01 Thread Liaw, Andy
Yes, you need to modify both the R and the underlying C code. It's the the source package on CRAN (the .tar.gz file). Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Josh Browning Sent: Friday, June 01, 2012 10:48 AM To: r

Re: [R] Random Forest Classification_ForestCombination

2012-05-29 Thread Liaw, Andy
As long as you can remember that the summaries such as variable importance, OOB predictions, and OOB error rates are not applicable, I think that should be fine. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Nikita Desai

Re: [R] Question about random Forest function in R

2012-05-29 Thread Liaw, Andy
Hi Kelly, The function has a limitation that it cannot handle any column in your "x" that is a categorical variable with more than 32 categories. One possibility is to see if you can "bin" some of the categories into one to get below 32 categories. Andy -Original Message- From: r-hel

Re: [R] Random forests prediction

2012-05-14 Thread Liaw, Andy
That's not how RF works at all. The setting of mtry is irrelevant to this. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of matt Sent: Monday, May 14, 2012 10:22 AM To: r-help@r-project.org Subject: Re: [R] Random forests pre

Re: [R] No Data in randomForest predict

2012-05-14 Thread Liaw, Andy
It doesn't: You just get an error if there are NAs in the data; e.g., R> rf1 = randomForest(iris[1:4], iris[[5]]) R> predict(rf1, newdata=data.frame(Sepal.Length=1, Sepal.Width=2, Petal.Length=3, Petal.Width=NA)) Error in predict.randomForest(rf1, newdata = data.frame(Sepal.Length = 1, : mis

Re: [R] Random forests prediction

2012-05-14 Thread Liaw, Andy
I don't think this is so hard to explain. If you evaluate AUC using either OOB prediction or on a test set (or something like CV or bootstrap), that would be what I expect for most data. When you add more variables (that are, say, less informative) to a model, the model has to look harder to f

Re: [R] Partial Dependence and RandomForest

2012-04-17 Thread Liaw, Andy
Note that the partialPlot() function also returns the x-y pairs being plotted, so you can work from there if you wish. As to SD, my guess is you want some sort of confidence interval or band around the curve? I do not know of any theory to produce that, but that may well just be my ignorance.

Re: [R] Execution speed in randomForest

2012-04-13 Thread Liaw, Andy
Without seeing your code, it's hard to say much more, but do avoid using formula when you have large data. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jason & Caroline Shaw Sent: Friday, April 06, 2012 1:20 PM To: jim ho

Re: [R] Partial Dependence and RandomForest

2012-04-13 Thread Liaw, Andy
Please read the help page for the partialPlot() function and make sure you learn about all its arguments (in particular, "which.class"). Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of jmc Sent: Wednesday, April 11, 2012 2:4

Re: [R] loess function take

2012-04-13 Thread Liaw, Andy
Alternatively, use only a subset to run loess(), either a random sample or something like every other k-th (sorted) data value, or the quantiles. It's hard for me to imagine that that many data points are going to improve your model much at all (unless you use tiny span). Andy From: r-help-b

Re: [R] Imputing missing values using "LSmeans" (i.e., population marginal means) - advice in R?

2012-04-05 Thread Liaw, Andy
Don't know how you searched, but perhaps this might help: https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Jenn Barrett > Sent: Tuesday, April 03, 2012 1:23 AM > To

Re: [R] Question about randomForest

2012-04-04 Thread Liaw, Andy
> From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Saruman > > I dont see how this answered the original question of the poster. > > He was quite clear: the value of the predictions coming out > of RF do not > match what comes out of the predict function u

Re: [R] Memory limits for MDSplot in randomForest package

2012-03-30 Thread Liaw, Andy
Sam, As you've probably seen, all the MDSplot() function does is feed 1 - proximity to the cmdscale() function. Some suggestion and clarification: 1. If all you want is the proximity matrix, you can run randomForest() with keep.forest=FALSE to save memory. You will likely want to run somewhat

Re: [R] fitted values with locfit

2012-03-28 Thread Liaw, Andy
I believe you are expecting the software to do what it did not claim being able to do. predict.locfit() does not have a "type" argument, nor can that take on "terms". When you specify two variables in the smooth, a bivariate smooth is done, so you get one bivariate smooth function, not the sum

[R] job opening at Merck Research Labs, NJ USA

2012-03-20 Thread Liaw, Andy
The Biometrics Research department at the Merck Research Laboratories has an open position to be located in Rahway, New Jersey, USA: This position will be responsible for imaging and bio-signal biomarkers projects including analysis of preclinical, early clinical, and experimental medicine imag

Re: [R] Using caegorical variables in package randomForest.

2012-03-13 Thread Liaw, Andy
The way to represent categorical variables is with factors. See ?factor. randomForest() will handle factors appropriately, as most modeling functions in R. Andy > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of abhishek >

Re: [R] Help on reshape function

2012-03-06 Thread Liaw, Andy
Just using the reshape() function in base R: df.long = reshape(df, varying=list(names(df)[4:7]), direction="long") This also gives two extra columns ("time" and "id") can can be dropped. Andy > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-29 Thread Liaw, Andy
That's why I said you need the book. The details are all in the book. From: Michael [mailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 1:49 PM To: Liaw, Andy Cc: r-help Subject: Re: [R] Good and modern Kernel Regression package in R with

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-23 Thread Liaw, Andy
__ From: Michael [mailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 10:06 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Thank you Andy! I went thru KernSmooth package but I don't se

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-23 Thread Liaw, Andy
ok to get most mileage out of it though. Andy From: Michael [mailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 12:25 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-22 Thread Liaw, Andy
Bert's question aside (I was going to ask about laundry, but that's much harder than taxes...), my understanding of the situation is that "optimal" is in the eye of the beholder. There were at least two schools of thought on which is the better way of automatically selecting bandwidth, using pl

Re: [R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)

2012-02-01 Thread Liaw, Andy
name in X) > > Hi Andy, > > On Tuesday, January 31, 2012 08:44:13 AM Liaw, Andy wrote: > > I'm not exactly sure if this is a problem with indexing by > name; i.e., is > > the following behavior by design? The problem is that > names or dimnames > > that ar

Re: [R] randomForest: proximity for new objects using an existing rf

2012-02-01 Thread Liaw, Andy
There's an alternative, but it may not be any more efficient in time or memory... You can run predict() on the training set once, setting nodes=TRUE. That will give you a n by ntree matrix of which node of which tree the data point falls in. For any new data, you would run predict() with node

Re: [R] Random Forest Package

2012-02-01 Thread Liaw, Andy
You should be able to use the Rgui menu to install packages. Andy > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Niratha > Sent: Wednesday, February 01, 2012 5:16 AM > To: r-help@r-project.org > Subject: [R] Random Forest P

[R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)

2012-01-31 Thread Liaw, Andy
I'm not exactly sure if this is a problem with indexing by name; i.e., is the following behavior by design? The problem is that names or dimnames that are empty seem to be treated differently, and one can't index by them: R> junk = 1:3 R> names(junk) = c("a", "b", "") R> junk a b 1 2 3 R> j

Re: [R] Bivariate Partial Dependence Plots in Random Forests

2012-01-31 Thread Liaw, Andy
The reason that it's not implemented is because of computational cost. Some users had done it on their own using the same idea. It's just that it takes too much memory for even moderately sized data. It can be done much more efficiently in MART because computational shortcuts were used. Be

Re: [R] Variable selection based on both training and testing data

2012-01-30 Thread Liaw, Andy
Variable section is part of the training process-- it chooses the model. By definition, test data is used only for testing (evaluating chosen model). If you find a package or function that does variable selection on test data, run from it! Best, Andy > -Original Message- > From: r-he

Re: [R] What is the function for "smoothing splines with the smoothing parameter selected by generalized maximum likelihood?

2012-01-09 Thread Liaw, Andy
See the gss package on CRAN. Andy > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of ali_protocol > Sent: Monday, January 09, 2012 7:13 AM > To: r-help@r-project.org > Subject: [R] What is the function for "smoothing splines wi

Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables

2011-12-05 Thread Liaw, Andy
You should see no differences beyond what you'd get by running RF a second time with a different random number seed. Best, Andy From: gianni lavaredo [mailto:gianni.lavar...@gmail.com] Sent: Monday, December 05, 2011 2:19 PM To: Liaw, Andy Cc: r-h

Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables

2011-12-05 Thread Liaw, Andy
Tree based models (such as RF) are invriant to monotonic transformations in the predictor (x) variables, because they only use the ranks of the variables, not their actual values. More specifically, they look for splits that are at the mid-points of unique values. Thus the resulting trees are

Re: [R] Random Forests in R

2011-12-01 Thread Liaw, Andy
The first version of the package was created by re-writing the main program in the original Fortran as C, and calls other Fortran subroutines that were mostly untouched, so dynamic memory allocation can be done. Later versions have most of the Fortran code translated/re-written in C. Currently

Re: [R] Question about randomForest

2011-11-28 Thread Liaw, Andy
Not only that, but in the same help page, same "Value" section, it says: predicted the predicted values of the input data based on out-of-bag samples so people really should read the help pages instead of speculate... If the error rates were not based on OOB samples, they would drop to (

Re: [R] tuning random forest. An unexpected result

2011-11-23 Thread Liaw, Andy
Gianni, You should not "tune" ntree in cross-validation or other validation methods, and especially should not be using OOB MSE to do so. 1. At ntree=1, you are using only about 36% of the data to assess the performance of a single random tree. This number can vary wildly. I'd say don't both

Re: [R] gsDesign

2011-11-15 Thread Liaw, Andy
Hi Dongli, Questions about usage of specific contributed packages are best directed toward the package maintainer/author first, as they are likely the best sources of information, and they don't necessarily subscribe to or keep up with the daily deluge of R-help messages. (In this particular c

Re: [R] randomForest - NaN in %IncMSE

2011-09-23 Thread Liaw, Andy
You are not giving anyone much to go on. Please read the posting guide and see how to ask your question in a way that's easier for others to answer. At the _very_ least, show what commands you used, what your data looks like, etc. Andy > -Original Message- > From: r-help-boun...@r-pr

Re: [R] class weights with Random Forest

2011-09-13 Thread Liaw, Andy
The current "classwt" option in the randomForest package has been there since the beginning, and is different from how the official Fortran code (version 4 and later) implements class weights. It simply account for the class weights in the Gini index calculation when splitting nodes, exactly as

Re: [R] randomForest memory footprint

2011-09-08 Thread Liaw, Andy
It looks like you are building a regression model. With such a large number of rows, you should try to limit the size of the trees by setting nodesize to something larger than the default (5). The issue, I suspect, is the fact that the size of the largest possible tree has about 2*nodesize nod

Re: [R] convert a splus randomforest object to R

2011-08-09 Thread Liaw, Andy
You really need to follow the suggestions in the posting guide to get the best help from this list. Which versions of randomForest are you using in S-PLUS and R? Which version of R are you using? When you restore the object into R, what does str(object) say? Have you also tried dump()/sour

Re: [R] randomForest partial dependence plot variable names

2011-08-09 Thread Liaw, Andy
See if the following is close to what you're looking for. If not, please give more detail on what you want to do. data(airquality) airquality <- na.omit(airquality) set.seed(131) ozone.rf <- randomForest(Ozone ~ ., airquality, importance=TRUE) imp <- importance(ozone.rf) # get the importance me

Re: [R] squared "pie chart" - is there such a thing?

2011-07-25 Thread Liaw, Andy
Has anyone suggested mosaic displays? That's the closest I can think of as a "square pie chart"... > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Naomi Robbins > Sent: Sunday, July 24, 2011 7:09 AM > To: Thomas Levine > Cc

Re: [R] *not* using attach() *but* in one case ....

2011-05-19 Thread Liaw, Andy
From: Prof Brian Ripley > > Hmm, load() does have an 'envir' argument. So you could simply use > that and with() (which is pretty much what attach() does internally). > > If people really wanted a lazy approach, with() could be extended to > allow file names (as attach does). I'm not sure if

Re: [R] Rotation Forest in R

2011-04-12 Thread Liaw, Andy
I don't have access to that article, but just reading the abstract, it should be quite easy to do by writing a wrapper function that calls randomForest(). I've done so with random projections before. One limitation to methods like these is that they only apply to all numeric data. Andy > -

Re: [R] Difference in mixture normals and one density

2011-04-04 Thread Liaw, Andy
Is something like this what you're looking for? R> library(nor1mix) R> nmix2 <- norMix(c(2, 3), sig2=c(25, 4), w=c(.2, .8)) R> dnorMix(1, nmix2) - dnorm(1, 2, 5) [1] 0.03422146 Andy > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Beha

Re: [R] ok to use glht() when interaction is NOT significant?

2011-03-08 Thread Liaw, Andy
Just to add my ever depreciating $0.02 USD: Keep in mind that the significance testing paradigm puts a constraint on false positive rate, and let false negative rate float. What you should consider is whether that makes sense in your situation. All too often this is not carefully considered, and

Re: [R] Coefficient of Determination for nonlinear function

2011-03-04 Thread Liaw, Andy
As far as I can tell, Uwe is not even fitting a model, but instead just solving a nonlinear equation, so I don't know why he wants a R^2. I don't see a statistical model here, so I don't know why one would want a statistical measure. Andy > -Original Message- > From: r-help-boun...@r-pr

Re: [R] lm - log(variable) - skip log(0)

2011-02-25 Thread Liaw, Andy
You need to use "==" instead of "=" for testing equality. While you're at it, you should check for positive values, not just screening out 0s. This works for me: R> mydata = data.frame(x=0:10, y=runif(11)) R> fm = lm(y ~ log(x), mydata, subset=x>0) Andy > -Original Message- > From:

Re: [R] Random Forest & Cross Validation

2011-02-24 Thread Liaw, Andy
Exactly as Max said. See the rfcv() function in the latest version of randomForest, as well as the reference in the help page for that function. OOB estimate is as accurate as CV estimate _if_ you run straight RF. Most other methods do not have this "feature". However, if you start adding ste

Re: [R] tri-cube and gaussian weights in loess

2011-02-07 Thread Liaw, Andy
Locfit() in the locfit package has a slightly more modern implementation of loess, and is much more flexible in that it has a lot of options to tweak. One such option is the kernel. There are seven to choose from. Andy From: wisdomtooth > > >From what I understand, loess in R uses the stand

Re: [R] How to measure/rank "variable importance" when using rpart?

2011-01-24 Thread Liaw, Andy
Check out caret::varImp.rpart(). It's described in the original CART book. Andy From: Tal Galili > > Hello all, > > When building a CART model (specifically classification tree) > using rpart, > it is sometimes interesting to know what is the importance of > the various > variables introduc

Re: [R] randomForest: too many elements specified?

2011-01-21 Thread Liaw, Andy
LP > 35 Gatehouse Drive > Waltham, MA 02451 > USA > 781-839-4304 > ryszard.czermin...@astrazeneca.com > > RE: [R] randomForest: too many element specified? > Liaw, Andy > Mon, 17 Jan 2005 05:56:28 -0800 > > From: luk > > > > When I run randonForest wi

Re: [R] Where is a package NEWS.Rd located?

2011-01-06 Thread Liaw, Andy
I was communicating with Kevin off-list. The problem seems to be run time, not install time. News() calls tools:::.build_news_db(), and the 2nd line of that function is: nfile <- file.path(dir, "inst", "NEWS.Rd") and that's the problem: an installed package shouldn't have an inst/ subdirector

Re: [R] randomForest speed improvements

2011-01-05 Thread Liaw, Andy
From: Liaw, Andy > > Note that that isn't exactly what I recommended. If you look at the > example in the help page for combine(), you'll see that it is > combining > RF objects trained on the same data; i.e., instead of having > one RF with > 500 trees, you can

Re: [R] randomForest speed improvements

2011-01-05 Thread Liaw, Andy
Note that that isn't exactly what I recommended. If you look at the example in the help page for combine(), you'll see that it is combining RF objects trained on the same data; i.e., instead of having one RF with 500 trees, you can combine five RFs trained on the same data with 100 trees each into

Re: [R] randomForest speed improvements

2011-01-04 Thread Liaw, Andy
If you have multiple cores, one "poor man's solution" is to run separate forests in different R sessions, save the RF objects, load them into the same session and combine() them. You can do this less clumsily if you use things like Rmpi or other distributed computing packages. Another considerati

Re: [R] randomForest: help with combine() function

2010-12-11 Thread Liaw, Andy
combine() is meant to be used on randomForest objects that were built from identical training data. Andy > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Dennis Duro > Sent: Friday, December 10, 2010 11:59 PM > To: r-help@r-pr

Re: [R] randomForest: How to append ID column along with predictions

2010-12-07 Thread Liaw, Andy
The order in the output correspond to the order of the input. I will patch the code so that it grabs the row names of the input (if exist). If you specify type="prob", it already labels the rows by the input row names. > -Original Message- > From: r-help-boun...@r-project.org > [mailto:

Re: [R] randomForest parameters for image classification

2010-11-18 Thread Liaw, Andy
data you want to predict, not the other way around. Andy > -Original Message- > From: Deschamps, Benjamin [mailto:benjamin.descha...@agr.gc.ca] > Sent: Tuesday, November 16, 2010 11:16 AM > To: r-help@r-project.org > Cc: Liaw, Andy > Subject: RE: [R] randomForest pa

Re: [R] randomForest parameters for image classification

2010-11-11 Thread Liaw, Andy
Please show us the code you used to run randomForest, the output, as well as what you get with other algorithms (on the same random subset for comparison). I have yet to see a dataset where randomForest does _far_ worse than other methods. Andy > -Original Message- > From: r-help-boun..

[R] Contract programming position at Merck (NJ, USA)

2010-10-29 Thread Liaw, Andy
Job: Scientific programmer at Merck, Biostatistics, Rahway, NJ, USA [Job Description] This position works closely with statisticians to process and analyze ultrasound, MRI, and radiotelemetry longitudinal studies using a series of programs developed in R and Mathworks/Matlab. This position provid

Re: [R] to determine the variable importance in svm

2010-10-26 Thread Liaw, Andy
The caret package has answers to all your questions. > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Neeti > Sent: Tuesday, October 26, 2010 10:42 AM > To: r-help@r-project.org > Subject: [R] to determine the variable importa

Re: [R] Random Forest AUC

2010-10-24 Thread Liaw, Andy
ing). > >> > >> For example, k nearest neighbors are not known to over > fit, but a 1nn > >> model will always perfectly predict the training data. > >> > >> Max > >> > >> On Oct 23, 2010, at 9:05 AM, "Liaw, > Andy" wro

Re: [R] Random Forest AUC

2010-10-23 Thread Liaw, Andy
What Breiman meant is that as the model gets more complex (i.e., as the number of trees tends to infinity) the geneeralization error (test set error) does not increase. This does not hold for boosting, for example; i.e., you can't "boost forever", which nececitate the need to find the optimal numb

Re: [R] Random Forest AUC

2010-10-22 Thread Liaw, Andy
Let me expand on what Max showed. For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data

Re: [R] RandomForest Proximity Matrix

2010-10-21 Thread Liaw, Andy
From: Michael Lindgren > > Greetings R Users! > > I am posting to inquire about the proximity matrix in the randomForest > R-package. I am having difficulty pushing very large data through the > algorithm and it appears to hang on the building of the prox > matrix. I have > read on Dr. Breiman

Re: [R] Force evaluation of variable when calling partialPlot

2010-10-04 Thread Liaw, Andy
The plot titles aren't pretty, but the following works for me: R> library(randomForest) randomForest 4.5-37 Type rfNews() to see new features/changes/bug fixes. R> set.seed(1004) R> iris.rf <- randomForest(iris[-5], iris[[5]], ntree=1001) R> par(mfrow=c(2,2)) R> for (i in 1:4) partialPlot(iris.rf,

Re: [R] randomForest - PartialPlot - reg

2010-09-24 Thread Liaw, Andy
In a partial dependence plot, only the relative scale, not absolute scale, of the y-axis is meaningful. I.e., you can compare the range of the curves between partial dependence plots of two different variables, but not the actual numbers on the axis. The range is compressed compared to the origin

Re: [R] Passing a function as a parameter...

2010-09-22 Thread Liaw, Andy
One possibility: R> f = function(x, f) eval(as.call(list(as.name(f), x))) R> f(1:10, "mean") [1] 5.5 R> f(1:10, "max") [1] 10 Andy From: Jonathan Greenberg > R-helpers: > > If I want to pass a character name of a function TO a > function, and then > have that function executed, how would I do

Re: [R] randomForest - partialPlot - Reg

2010-09-22 Thread Liaw, Andy
> From: Vijayan Padmanabhan > > Dear R Group > I had an observation that in some cases, when I use the > randomForest model > to create partialPlot in R using the package "randomForest" > the y-axis displays values that are more than -1! > It is a classification problem that i was trying to addr

Re: [R] OT: Is randomization for targeted cancer therapies ethical?

2010-09-21 Thread Liaw, Andy
> From: jlu...@ria.buffalo.edu > > Clearly inferior treatments are unethical. The Big Question is: What constitute "clearly"? Who or How to decide what is "clearly"? I'm sure there are plenty of people who don't understand much Statistics and are perfectly willing to say the results on the tw

Re: [R] Decision Tree in Python or C++?

2010-09-08 Thread Liaw, Andy
For Python, check out the project "orange": http://www.ailab.si/orange/doc/catalog/Classify/ClassificationTree.htm Not sure about C++, but OpenDT is in C: http://opendt.sourceforge.net/ Looks like OpenCV has both Python and C++ interface (didn't see Python interace to decision tree, though): htt

Re: [R] RandomForests Limitations? Work Arounds?

2010-09-07 Thread Liaw, Andy
You're not giving us much to go on, so the info I can give is correspondingly vague. I take it you are using RF in "unsupervised" mode. What RF does in this case is simply generate a second part of the data that have the same marginal distribution as the data you have, but the variables are indep

[R] Open position at Merck (NJ, USA)

2010-09-07 Thread Liaw, Andy
Job description: Computational statistician/biometrician The Biometrics Research Department at Merck Research Laboratories, Merck & Co., Inc. in Rahway, NJ, is seeking a highly motivated statistician/data analyst to work in its basic research, drug discovery, preclinical and early clinical develo

Re: [R] predict.loess and NA/NaN values

2010-08-27 Thread Liaw, Andy
From: Philipp Pagel > > In a current project, I am fitting loess models to subsets of data in > order to use the loess predicitons for normalization (similar to what > is done in many microarray analyses). While working on this I ran into > a problem when I tried to predict from the loess models a

Re: [R] Learning ANOVA

2010-08-16 Thread Liaw, Andy
From: Stephen Liu > > Hi JesperHybel, > > Thanks for your advice. > > >If you're trying to follow the youtube video you have a > typing mistake here: > > >InsectSprays.aov <-(test01$count ~ test01$spray) > > >I think this should be: > > >InsectSprays.aov <-aov(test01$count ~ test01$spray) >

Re: [R] Learning ANOVA

2010-08-13 Thread Liaw, Andy
From: Stephen Liu > > Hi folks, > > R on Ubuntu 10.04 64 bit. > > Performed following steps on R:- > > ### to access to the object > > data(InsectSprays) > > ### create a .csv file > > write.csv(InsectSprays, "InsectSpraysCopy.csv") > > > On another terminal > $ sudo updatedb > $ locate Inse

Re: [R] Error on random forest variable importance estimates

2010-08-06 Thread Liaw, Andy
From: Pierre Dubath > > Hello, > > I am using the R randomForest package to classify variable > stars. I have > a training set of 1755 stars described by (too) many > variables. Some of > these variables are highly correlated. > > I believe that I understand how randomForest works and how >

Re: [R] Collinearity in Moderated Multiple Regression

2010-08-04 Thread Liaw, Andy
Seems to me it may be worth stating what may be elementary to some on this list: - If all relevant variables are included in the model and the "true model" is indeed linear, then all least squares estimated coefficients are unbiased. [ David Ruppert once said about the three kinds of lies: Lie

Re: [R] Collinearity in Moderated Multiple Regression

2010-08-03 Thread Liaw, Andy
If the collinearity you're seeing arose from the addition of a product (interaction) term, I do not think penalization is the best answer. What is the goal of your analysis? If it's prediction, then I wouldn't worry about this type of collinearity. If you're interested in inference, I'd try some

Re: [R] Problems with normality req. for ANOVA

2010-08-03 Thread Liaw, Andy
As a matter of fact, I would say both Bert and I encounter "designed experiments" a lot more than "observational studies", yet we speak from experience that those things that Bert mentioned happen on a daily basis. When you talk to experimenters, ask your questions carefully and you'll see these t

Re: [R] randomForest outlier return NA

2010-07-15 Thread Liaw, Andy
There's a bug in the code. If you add row names to the X matrix befor you call randomForest(), you'd get: R> summary (outlier(mdl.rf) ) Min. 1st Qu. MedianMean 3rd Qu.Max. -1.0580 -0.5957 0. 0.6406 1.2650 9.5200 I'll fix this in the next release. Thanks for reporting. Bes

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

2010-07-02 Thread Liaw, Andy
I'll incorporate some of these ideas into the next release. Thanks! Best, Andy -Original Message- From: h.wick...@gmail.com [mailto:h.wick...@gmail.com] On Behalf Of Hadley Wickham Sent: Thursday, July 01, 2010 8:08 PM To: Mike Williamson Cc: Liaw, Andy; r-help Subject: Re: [R] a

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

2010-07-01 Thread Liaw, Andy
roughfix(x)) user system elapsed 8.440.398.85 R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB ram. Andy From: Mike Williamson [mailto:this.is@gmail.com] Sent: Thursday, July 01, 2010 12:48 PM To: Liaw, Andy Cc: r-h

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

2010-07-01 Thread Liaw, Andy
You have not shown any code on exactly how you use na.roughfix(), so I can only guess. If you are doing something like: randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) I would not be surprised that it's taking very long on large datasets. Most likely it's caused by the formula inter

Re: [R] Linear Discriminant Analysis in R

2010-05-28 Thread Liaw, Andy
cobler_squad needs more basic help than doing lda. The data input just doesn't make sense. If vowel_feature is a data frame, than G <- vowel_feature[15] creates another data frame containing the 15th variable in vowel_feature, so "G" is the name of a data frame, not a variable in a data frame.

  1   2   3   >