Dominik, There are a number of formulations of this statistic (see the Kvålseth[*] reference below).
I tend to think of R^2 as the proportion of variance explained by the model[**]. With the "traditional" formula, it is possible to get negative proportions (if there are extreme outliers in the predictions, the negative proportion can be very large). I used this formulation because it is always on (0, 1). It is called "R^2" after all! Here is an example: > set.seed(1) > simObserved <- rnorm(100) > simPredicted <- simObserved + rnorm(100)*.1 > > cor(simObserved, simPredicted)^2 [1] 0.9887525 > customSummary(data.frame(obs = simObserved, + pred = simPredicted)) RMSE Rsquared 0.09538273 0.98860908 > > simPredicted[1] [1] -0.6884905 > simPredicted[1] <- 10 > > cor(simObserved, simPredicted)^2 [1] 0.3669257 > customSummary(data.frame(obs = simObserved, + pred = simPredicted)) RMSE Rsquared 1.066900 -0.425169 It is somewhat extreme, but it does happen. Max * Kvålseth, T. (1985). Cautionary note about $R^2$. American statistician, 39(4), 279–285. * This is a very controversial statement when non-linear models are used. I'd rather use RMSE, but many scientists I work with still think in terms of R^2 regardless of the model. The randomForest function also computes this statistic, but calls it "% Var explained" instead of explicitly labeling it as "R^2". This statistic has generated heated debates and I hope that I will not have to wear a scarlet R in Nashville in a few weeks. On Thu, May 17, 2012 at 1:35 PM, Dominik Bruhn <domi...@dbruhn.de> wrote: > Hy Max, > thanks again for the answer. > > I checked the caret implementation and you were right. If the > predictions for the model constant (or sd(pred)==0) then the > implementation returns a NA for the rSquare (in postResample). This is > mainly because the caret implementation uses `cor` (from the > stats-package) which would throw a error for values with sd(pred)==0. > > Do you know why this is implemented in this way? I wrote my own > summaryFunction which calculates rSquare by hand and it works fine. It > nevertheless does NOT(!) generate the same values as the original > implementation. It seems that the calcuation of Rsquare does not seem to > be consistent. I took mine from Wikipedia [1]. > > Here is my code: > --- > customSummary <- function (data, lev = NULL, model = NULL) { > #Calulate rSquare > ssTot <- sum((data$obs-mean(data$obs))^2) > ssErr <- sum((data$obs-data$pred)^2) > rSquare <- 1-(ssErr/ssTot) > > #Calculate MSE > mse <- mean((data$pred - data$obs)^2) > > #Aggregate > out <- c(sqrt(mse), 1-(ssErr/ssTot)) > names(out) <- c("RMSE", "Rsquared") > > return(out) > } > --- > > [1]: http://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions > > Thanks! > Dominik > > > > > On 17/05/12 04:10, Max Kuhn wrote: >> Dominik, >> >> See this line: >> >>> Min. 1st Qu. Median Mean 3rd Qu. Max. >>> 30.37 30.37 30.37 30.37 30.37 30.37 >> >> The variance of the predictions is zero. caret uses the formula for >> R^2 by calculating the correlation between the observed data and the >> predictions which uses sd(pred) which is zero. I believe that the same >> would occur with other formulas for R^2. >> >> Max >> >> On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn <domi...@dbruhn.de> wrote: >>> Thanks Max for your answer. >>> >>> First, I do not understand your post. Why is it a problem if two of >>> predictions match? From the formula for calculating R^2 I can see that >>> there will be a DivByZero iff the total sum of squares is 0. This is >>> only true if the predictions of all the predicted points from the >>> test-set are equal to the mean of the test-set. Why should this happen? >>> >>> Anyway, I wrote the following code to check what you tried to tell: >>> >>> -- >>> library(caret) >>> data(trees) >>> formula=Volume~Girth+Height >>> >>> customSummary <- function (data, lev = NULL, model = NULL) { >>> print(summary(data$pred)) >>> return(defaultSummary(data, lev, model)) >>> } >>> >>> tc=trainControl(method='cv', summaryFunction=customSummary) >>> train(formula, data=trees, method='rpart', trControl=tc) >>> -- >>> >>> This outputs: >>> --- >>> Min. 1st Qu. Median Mean 3rd Qu. Max. >>> 18.45 18.45 18.45 30.12 35.95 53.44 >>> Min. 1st Qu. Median Mean 3rd Qu. Max. >>> 22.69 22.69 22.69 32.94 38.06 53.44 >>> Min. 1st Qu. Median Mean 3rd Qu. Max. >>> 30.37 30.37 30.37 30.37 30.37 30.37 >>> [cut many values like this] >>> Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo, >>> method = method, : >>> There were missing values in resampled performance measures. >>> ----- >>> >>> As I didn't understand your post, I don't know if this confirms your >>> assumption. >>> >>> Thanks anyway, >>> Dominik >>> >>> >>> On 16/05/12 17:30, Max Kuhn wrote: >>>> More information is needed to be sure, but it is most likely that some >>>> of the resampled rpart models produce the same prediction for the >>>> hold-out samples (likely the result of no viable split being found). >>>> >>>> Almost every incarnation of R^2 requires the variance of the >>>> prediction. This particular failure mode would result in a divide by >>>> zero. >>>> >>>> Try using you own summary function (see ?trainControl) and put a >>>> print(summary(data$pred)) in there to verify my claim. >>>> >>>> Max >>>> >>>> On Wed, May 16, 2012 at 11:30 AM, Max Kuhn <mxk...@gmail.com> wrote: >>>>> More information is needed to be sure, but it is most likely that some >>>>> of the resampled rpart models produce the same prediction for the >>>>> hold-out samples (likely the result of no viable split being found). >>>>> >>>>> Almost every incarnation of R^2 requires the variance of the >>>>> prediction. This particular failure mode would result in a divide by >>>>> zero. >>>>> >>>>> Try using you own summary function (see ?trainControl) and put a >>>>> print(summary(data$pred)) in there to verify my claim. >>>>> >>>>> Max >>>>> >>>>> On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn <domi...@dbruhn.de> wrote: >>>>>> Hy, >>>>>> I got the following problem when trying to build a rpart model and using >>>>>> everything but LOOCV. Originally, I wanted to used k-fold partitioning, >>>>>> but every partitioning except LOOCV throws the following warning: >>>>>> >>>>>> ---- >>>>>> Warning message: In nominalTrainWorkflow(dat = trainData, info = >>>>>> trainInfo, method = method, : There were missing values in resampled >>>>>> performance measures. >>>>>> ----- >>>>>> >>>>>> Below are some simplified testcases which repoduce the warning on my >>>>>> system. >>>>>> >>>>>> Question: What does this error mean? How can I avoid it? >>>>>> >>>>>> System-Information: >>>>>> ----- >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>>>>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] rpart_3.1-52 caret_5.15-023 foreach_1.4.0 cluster_1.14.2 >>>>>> reshape_0.8.4 >>>>>> [6] plyr_1.7.1 lattice_0.20-6 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6 >>>>>> [5] tools_2.15.0 >>>>>> ------- >>>>>> >>>>>> >>>>>> Simlified Testcase I: Throws warning >>>>>> --- >>>>>> library(caret) >>>>>> data(trees) >>>>>> formula=Volume~Girth+Height >>>>>> train(formula, data=trees, method='rpart') >>>>>> --- >>>>>> >>>>>> Simlified Testcase II: Every other CV-method also throws the warning, >>>>>> for example using 'cv': >>>>>> --- >>>>>> library(caret) >>>>>> data(trees) >>>>>> formula=Volume~Girth+Height >>>>>> tc=trainControl(method='cv') >>>>>> train(formula, data=trees, method='rpart', trControl=tc) >>>>>> --- >>>>>> >>>>>> Simlified Testcase III: The only CV-method which is working is 'LOOCV': >>>>>> --- >>>>>> library(caret) >>>>>> data(trees) >>>>>> formula=Volume~Girth+Height >>>>>> tc=trainControl(method='LOOCV') >>>>>> train(formula, data=trees, method='rpart', trControl=tc) >>>>>> --- >>>>>> >>>>>> >>>>>> Thanks! >>>>>> -- >>>>>> Dominik Bruhn >>>>>> mailto: domi...@dbruhn.de >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Max >>>> >>>> >>>> >>> >>> >>> -- >>> Dominik Bruhn >>> mailto: domi...@dbruhn.de >>> >> >> >> > > > -- > Dominik Bruhn > mailto: domi...@dbruhn.de > -- Max ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.