Hi Stephane, According to the NEWS file, as of 2.11.0: "cor() and cov() now test for misuse with non-numeric arguments, such as the non-bug report PR#14207" so there is no need for a new bug report.
Here is a simple way to select only numeric columns: # Sample data dat <- data.frame(a = 1:10L, b = runif(10), c = paste(1:10), d = rep(TRUE, 10), e = factor(rep("a", 10)), stringsAsFactors = FALSE) # (this includes numeric and integer, btw) dat[, sapply(dat, is.numeric)] # if you wanted to include logicals (which cor() will work with) class.test <- function(x) { output <- FALSE if(is.numeric(x) | is.logical(x)) { output <- TRUE} return(output) } # Columns that are numeric or logical dat[, sapply(dat, class.test)] HTH, Josh On Thu, Sep 9, 2010 at 10:53 AM, Stephane Vaucher <vauch...@iro.umontreal.ca> wrote: > Hi Josh, > > Initially, I was expecting R to simply ignore non-numeric data. I guess I > was wrong... I copy-pasted what I observe, and I do not get an error when > calculating correlations with text data. I can also do cor(test.n$P3, > test$P7) without an error. > > If you have a function to select only numeric columns that you can share > with me (and the list), that would be great. Of course, I'm wondering why > your version of R produces different results from mine. I don't know if I > should open a bug report. It would be good if someone (other than me) > observed this problem in their environment. > > Here is what I am currently using: > > R version 2.10.1 (2009-12-14) > x86_64-pc-linux-gnu > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > The behaviour has been observed on: >> >> sessionInfo() > > Version 2.3.1 (2006-06-01) > x86_64-redhat-linux-gnu > > attached base packages: > [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" > [7] "base" > > As well as on a 32 bit linux arch v2.9.0. > > Sincere regards, > sv > > On Thu, 9 Sep 2010, Joshua Wiley wrote: > >> Hi Stephane, >> >> When I use your sample data (e.g., test, test.number), cor() throws an >> error that x must be numeric (because of the factor or character >> data). Are you not getting any errors when trying to calculate the >> correlation on these data? If you are not, I wonder what version of R >> are you using? The quickest way to find out is sessionInfo(). >> >> As far as a work around, it would be relative simple to find out which >> columns of your data frame were not numeric or integer and exclude >> those (I'm happy to provide that code if you want). >> >> Best regards, >> >> Josh >> >> On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher >> <vauch...@iro.umontreal.ca> wrote: >>> >>> Thank you Dennis, >>> >>> You identified a factor (text column) that I was concerned with. I >>> simplified my example to try and factor out possible causes. I eliminated >>> the recurring values in columns (which were not the columns that caused >>> problems). I produced three examples with simple data sets. >>> >>> 1. Correct output, 2 columns only: >>> >>>> test.notext = read.csv('test-notext.csv') >>>> cor(test.notext, method='spearman') >>> >>> P3 HP_tot >>> P3 1.0000000 -0.2182876 >>> HP_tot -0.2182876 1.0000000 >>>> >>>> dput(test.notext) >>> >>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, >>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), >>> HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, >>> 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, >>> 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c("P3", "HP_tot" >>> ), class = "data.frame", row.names = c(NA, -25L)) >>> >>> 2. Incorrect output where I introduced my P7 column containing text only >>> the >>> 'a' character: >>> >>>> test = read.csv('test.csv') >>>> cor(test, method='spearman') >>> >>> P3 P7 HP_tot >>> P3 1.0000000 NA -0.2502878 >>> P7 NA 1 NA >>> HP_tot -0.2502878 NA 1.0000000 >>> Warning message: >>> In cor(test, method = "spearman") : the standard deviation is zero >>>> >>>> dput(test) >>> >>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, >>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), >>> P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, >>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L >>> ), .Label = "a", class = "factor"), HP_tot = c(10L, 10L, >>> 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, >>> 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, >>> 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", >>> row.names >>> = c(NA, >>> -25L)) >>> >>> 3. Incorrect output with P7 containing a variety of alpha-numeric >>> characters >>> (ascii), to factor out equal valued column issue. Notice that the text >>> column is interpreted as a numeric value. >>> >>>> test.number = read.csv('test-alpha.csv') >>>> cor(test.number, method='spearman') >>> >>> P3 P7 HP_tot >>> P3 1.0000000 0.4093108 -0.2502878 >>> P7 0.4093108 1.0000000 -0.3807193 >>> HP_tot -0.2502878 -0.3807193 1.0000000 >>>> >>>> dput(test.number) >>> >>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, >>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), >>> P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, >>> 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L, >>> 7L, 8L, 9L, 10L), .Label = c("0", "1", "2", "3", "4", "5", >>> "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h", >>> "i", "j", "k", "l", "m", "n", "o"), class = "factor"), HP_tot = c(10L, >>> 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, >>> 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, >>> 15L, 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", >>> row.names = c(NA, >>> -25L)) >>> >>> Correct output is obtained by avoiding matrix computation of correlation: >>>> >>>> cor(test.number$P3, test.number$HP_tot, method='spearman') >>> >>> [1] -0.2182876 >>> >>> It seems that a text column corrupts my correlation calculation (only in >>> a >>> matrix calculation). I assumed that text columns would not influence the >>> result of the calculations. >>> >>> Is this a correct behaviour? If not,I can submit a bug report? If it is, >>> is >>> there a known workaround? >>> >>> cheers, >>> Stephane Vaucher >>> >>> On Thu, 9 Sep 2010, Dennis Murphy wrote: >>> >>>> Did you try taking out P7, which is text? Moreover, if you get a message >>>> saying ' the standard deviation is zero', it means that the entire >>>> column >>>> is >>>> constant. By definition, the covariance of a constant with a random >>>> variable >>>> is 0, but your data consists of values, so cor() understandably throws a >>>> warning that one or more of your columns are constant. Applying the >>>> following to your data (which I named expd instead), we get >>>> >>>> sapply(expd[, -12], var) >>>> P1 P2 P3 P4 P5 >>>> P6 >>>> 5.433333e-01 1.083333e+00 5.766667e-01 1.083333e+00 6.433333e-01 >>>> 5.566667e-01 >>>> P8 P9 P10 P11 P12 >>>> SITE >>>> 5.733333e-01 3.193333e+00 5.066667e-01 2.500000e-01 5.500000e+00 >>>> 2.493333e+00 >>>> Errors warnings Manual Total H_tot >>>> HP1.1 >>>> 9.072840e+03 2.081334e+04 7.433333e-01 3.823500e+04 3.880250e+03 >>>> 2.676667e+00 >>>> HP1.2 HP1.3 HP1.4 HP_tot HO1.1 >>>> HO1.2 >>>> 0.000000e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.400000e-01 >>>> 0.000000e+00 >>>> HO1.3 HO1.4 HO_tot HU1.1 HU1.2 >>>> HU1.3 >>>> 0.000000e+00 0.000000e+00 8.400000e-01 0.000000e+00 2.100000e-01 >>>> 2.266667e-01 >>>> HU_tot HR L_tot LP1.1 LP1.2 >>>> LP1.3 >>>> 6.233333e-01 7.433333e-01 3.754610e+03 3.209333e+01 0.000000e+00 >>>> 2.065010e+03 >>>> LP1.4 LP_tot LO1.1 LO1.2 LO1.3 >>>> LO1.4 >>>> 2.246233e+02 3.590040e+03 3.684000e+01 0.000000e+00 0.000000e+00 >>>> 2.840000e+00 >>>> LO_tot LU1.1 LU1.2 LU1.3 LU_tot >>>> LR_tot >>>> 6.000000e+01 0.000000e+00 1.440000e+00 3.626667e+00 8.373333e+00 >>>> 4.943333e+00 >>>> SP_tot SP1.1 SP1.2 SP1.3 SP1.4 >>>> SP_tot.1 >>>> 6.911067e+02 4.225000e+01 0.000000e+00 1.009600e+02 4.161600e+02 >>>> 3.071600e+02 >>>> SO1.1 SO1.2 SO1.3 SO1.4 SO_tot >>>> SU1.1 >>>> 4.543333e+00 2.500000e-01 0.000000e+00 2.100000e-01 5.250000e+00 >>>> 0.000000e+00 >>>> SU1.2 SU1.3 SU_tot SR >>>> 1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01 >>>> >>>> Which columns are constant? >>>> which(sapply(expd[, -12], var) < .Machine$double.eps) >>>> HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1 >>>> 19 24 25 26 28 35 40 41 44 51 57 60 >>>> >>>> I suspect that in your real data set, there aren't so many constant >>>> columns, >>>> but this is one way to check. >>>> >>>> HTH, >>>> Dennis >>>> >>>> On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher >>>> <vauch...@iro.umontreal.ca >>>>> >>>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I'm observing what I believe is weird behaviour when attempting to do >>>>> something very simple. I want a correlation matrix, but my matrix seems >>>>> to >>>>> contain correlation values that are not found when executed on pairs: >>>>> >>>>> test2$P2 >>>>>> >>>>> [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3 >>>>> >>>>>> test2$HP_tot >>>>>> >>>>> [1] 10 10 10 10 10 10 10 10 136 136 136 136 136 136 136 136 >>>>> 136 >>>>> 136 15 >>>>> [20] 15 15 15 15 15 15 >>>>> c=cor(test2$P3,test2$HP_tot,method='spearman') >>>>> >>>>>> c >>>>>> >>>>> [1] -0.2182876 >>>>> >>>>>> c=cor(test2,method='spearman') >>>>>> >>>>> Warning message: >>>>> In cor(test2, method = "spearman") : the standard deviation is zero >>>>> >>>>>> write(c,file='out.csv') >>>>>> >>>>> >>>>> from my spreadsheet >>>>> -0.25028783918741 >>>>> >>>>> Most cells are correct, but not that one. >>>>> >>>>> If this is expected behaviour, I apologise for bothering you, I read >>>>> the >>>>> documentation, but I do not know if the calculation of matrices and >>>>> pairs >>>>> is >>>>> done using the same function (eg, with respect to equal value >>>>> observations). >>>>> >>>>> If this is not a desired behaviour, I noticed that it only occurs with >>>>> a >>>>> relatively large matrix (I couldn't reproduce on a simple 2 column data >>>>> set). There might be a naming error. >>>>> >>>>> names(test2) >>>>>> >>>>> [1] "ID" "NOMBRE" "MAIL" >>>>> [4] "Age" "SEXO" "Studies" >>>>> [7] "Hours_Internet" "Vision.Disabilities" "Other.disabilities" >>>>> [10] "Technology_Knowledge" "Start_Time" "End_Time" >>>>> [13] "Duration" "P1" "P1Book" >>>>> [16] "P1DVD" "P2" "P3" >>>>> [19] "P4" "P5" "P6" >>>>> [22] "P8" "P9" "P10" >>>>> [25] "P11" "P12" "P7" >>>>> [28] "SITE" "Errors" "warnings" >>>>> [31] "Manual" "Total" "H_tot" >>>>> [34] "HP1.1" "HP1.2" "HP1.3" >>>>> [37] "HP1.4" "HP_tot" "HO1.1" >>>>> [40] "HO1.2" "HO1.3" "HO1.4" >>>>> [43] "HO_tot" "HU1.1" "HU1.2" >>>>> [46] "HU1.3" "HU_tot" "HR" >>>>> [49] "L_tot" "LP1.1" "LP1.2" >>>>> [52] "LP1.3" "LP1.4" "LP_tot" >>>>> [55] "LO1.1" "LO1.2" "LO1.3" >>>>> [58] "LO1.4" "LO_tot" "LU1.1" >>>>> [61] "LU1.2" "LU1.3" "LU_tot" >>>>> [64] "LR_tot" "SP_tot" "SP1.1" >>>>> [67] "SP1.2" "SP1.3" "SP1.4" >>>>> [70] "SP_tot.1" "SO1.1" "SO1.2" >>>>> [73] "SO1.3" "SO1.4" "SO_tot" >>>>> [76] "SU1.1" "SU1.2" "SU1.3" >>>>> [79] "SU_tot" "SR" >>>>> >>>>> Thank you in advance, >>>>> Stephane Vaucher >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> > -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.