Re: [R] Correlation question

Stephane Vaucher Thu, 09 Sep 2010 10:55:43 -0700

Hi Josh,

Initially, I was expecting R to simply ignore non-numeric data. I guess Iwas wrong... I copy-pasted what I observe, and I do not get an error whencalculating correlations with text data. I can also do cor(test.n$P3,test$P7) without an error.

If you have a function to select only numeric columns thatyou can share with me (and the list), that would be great. Of course, I'mwondering why your version of R produces different results from mine. Idon't know if I should open a bug report. It would be good if someone(other than me) observed this problem in their environment.


Here is what I am currently using:

R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

The behaviour has been observed on:

sessionInfo()

Version 2.3.1 (2006-06-01)
x86_64-redhat-linux-gnu

attached base packages:
[1] "methods"   "stats"     "graphics"  "grDevices" "utils"     "datasets"
[7] "base"

As well as on a 32 bit linux arch v2.9.0.

Sincere regards,
sv

On Thu, 9 Sep 2010, Joshua Wiley wrote:

Hi Stephane,

When I use your sample data (e.g., test, test.number), cor() throws an
error that x must be numeric (because of the factor or character
data).  Are you not getting any errors when trying to calculate the
correlation on these data?  If you are not, I wonder what version of R
are you using?  The quickest way to find out is sessionInfo().

As far as a work around, it would be relative simple to find out which
columns of your data frame were not numeric or integer and exclude
those (I'm happy to provide that code if you want).

Best regards,

Josh

On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
<vauch...@iro.umontreal.ca> wrote:

Thank you Dennis,

You identified a factor (text column) that I was concerned with. I
simplified my example to try and factor out possible causes. I eliminated
the recurring values in columns (which were not the columns that caused
problems). I produced three examples with simple data sets.

1. Correct output, 2 columns only:

test.notext = read.csv('test-notext.csv')
cor(test.notext, method='spearman')


              P3     HP_tot
P3      1.0000000 -0.2182876
HP_tot -0.2182876  1.0000000


dput(test.notext)


structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
   HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
   136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
   15L, 15L, 15L, 15L, 15L, 15L)), .Names = c("P3", "HP_tot"
), class = "data.frame", row.names = c(NA, -25L))

2. Incorrect output where I introduced my P7 column containing text only the
'a' character:

test = read.csv('test.csv')
cor(test, method='spearman')


              P3 P7     HP_tot
P3      1.0000000 NA -0.2502878
P7             NA  1         NA
HP_tot -0.2502878 NA  1.0000000
Warning message:
In cor(test, method = "spearman") : the standard deviation is zero


dput(test)


structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
   P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
   1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
   ), .Label = "a", class = "factor"), HP_tot = c(10L, 10L,
   10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
   136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
   15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", row.names
= c(NA,
-25L))

3. Incorrect output with P7 containing a variety of alpha-numeric characters
(ascii), to factor out equal valued column issue. Notice that the text
column is interpreted as a numeric value.

test.number = read.csv('test-alpha.csv')
cor(test.number, method='spearman')


              P3         P7     HP_tot
P3      1.0000000  0.4093108 -0.2502878
P7      0.4093108  1.0000000 -0.3807193
HP_tot -0.2502878 -0.3807193  1.0000000


dput(test.number)


structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
   P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
   19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
   7L, 8L, 9L, 10L), .Label = c("0", "1", "2", "3", "4", "5",
   "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h",
   "i", "j", "k", "l", "m", "n", "o"), class = "factor"), HP_tot = c(10L,
   10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
   136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
   15L, 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame",
row.names = c(NA,
-25L))

Correct output is obtained by avoiding matrix computation of correlation:


cor(test.number$P3, test.number$HP_tot, method='spearman')


[1] -0.2182876

It seems that a text column corrupts my correlation calculation (only in a
matrix calculation). I assumed that text columns would not influence the
result of the calculations.

Is this a correct behaviour? If not,I can submit a bug report? If it is, is
there a known workaround?

cheers,
Stephane Vaucher

On Thu, 9 Sep 2010, Dennis Murphy wrote:

Did you try taking out P7, which is text? Moreover, if you get a message
saying ' the standard deviation is zero', it means that the entire column
is
constant. By definition, the covariance of a constant with a random
variable
is 0, but your data consists of values, so cor() understandably throws a
warning that one or more of your columns are constant. Applying the
following to your data (which I named expd instead),  we get

sapply(expd[, -12], var)
        P1           P2           P3           P4           P5
P6
5.433333e-01 1.083333e+00 5.766667e-01 1.083333e+00 6.433333e-01
5.566667e-01
        P8           P9          P10          P11          P12
SITE
5.733333e-01 3.193333e+00 5.066667e-01 2.500000e-01 5.500000e+00
2.493333e+00
    Errors     warnings       Manual        Total        H_tot
HP1.1
9.072840e+03 2.081334e+04 7.433333e-01 3.823500e+04 3.880250e+03
2.676667e+00
     HP1.2        HP1.3        HP1.4       HP_tot        HO1.1
HO1.2
0.000000e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.400000e-01
0.000000e+00
     HO1.3        HO1.4       HO_tot        HU1.1        HU1.2
HU1.3
0.000000e+00 0.000000e+00 8.400000e-01 0.000000e+00 2.100000e-01
2.266667e-01
    HU_tot           HR        L_tot        LP1.1        LP1.2
LP1.3
6.233333e-01 7.433333e-01 3.754610e+03 3.209333e+01 0.000000e+00
2.065010e+03
     LP1.4       LP_tot        LO1.1        LO1.2        LO1.3
LO1.4
2.246233e+02 3.590040e+03 3.684000e+01 0.000000e+00 0.000000e+00
2.840000e+00
    LO_tot        LU1.1        LU1.2        LU1.3       LU_tot
LR_tot
6.000000e+01 0.000000e+00 1.440000e+00 3.626667e+00 8.373333e+00
4.943333e+00
    SP_tot        SP1.1        SP1.2        SP1.3        SP1.4
SP_tot.1
6.911067e+02 4.225000e+01 0.000000e+00 1.009600e+02 4.161600e+02
3.071600e+02
     SO1.1        SO1.2        SO1.3        SO1.4       SO_tot
SU1.1
4.543333e+00 2.500000e-01 0.000000e+00 2.100000e-01 5.250000e+00
0.000000e+00
     SU1.2        SU1.3       SU_tot           SR
1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01

Which columns are constant?
which(sapply(expd[, -12], var) < .Machine$double.eps)
HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1
 19    24    25    26    28    35    40    41    44    51    57    60

I suspect that in your real data set, there aren't so many constant
columns,
but this is one way to check.

HTH,
Dennis

On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher
<vauch...@iro.umontreal.ca


wrote:

Hi everyone,

I'm observing what I believe is weird behaviour when attempting to do
something very simple. I want a correlation matrix, but my matrix seems
to
contain correlation values that are not found when executed on pairs:

 test2$P2

 [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3

test2$HP_tot

 [1]  10  10  10  10  10  10  10  10 136 136 136 136 136 136 136 136 136
136  15
[20]  15  15  15  15  15  15
c=cor(test2$P3,test2$HP_tot,method='spearman')

[1] -0.2182876

c=cor(test2,method='spearman')

Warning message:
In cor(test2, method = "spearman") : the standard deviation is zero

write(c,file='out.csv')


from my spreadsheet
-0.25028783918741

Most cells are correct, but not that one.

If this is expected behaviour, I apologise for bothering you, I read the
documentation, but I do not know if the calculation of matrices and pairs
is
done using the same function (eg, with respect to equal value
observations).

If this is not a desired behaviour, I noticed that it only occurs with a
relatively large matrix (I couldn't reproduce on a simple 2 column data
set). There might be a naming error.

 names(test2)

 [1] "ID"                   "NOMBRE"               "MAIL"
 [4] "Age"                  "SEXO"                 "Studies"
 [7] "Hours_Internet"       "Vision.Disabilities"  "Other.disabilities"
[10] "Technology_Knowledge" "Start_Time"           "End_Time"
[13] "Duration"             "P1"                   "P1Book"
[16] "P1DVD"                "P2"                   "P3"
[19] "P4"                   "P5"                   "P6"
[22] "P8"                   "P9"                   "P10"
[25] "P11"                  "P12"                  "P7"
[28] "SITE"                 "Errors"               "warnings"
[31] "Manual"               "Total"                "H_tot"
[34] "HP1.1"                "HP1.2"                "HP1.3"
[37] "HP1.4"                "HP_tot"               "HO1.1"
[40] "HO1.2"                "HO1.3"                "HO1.4"
[43] "HO_tot"               "HU1.1"                "HU1.2"
[46] "HU1.3"                "HU_tot"               "HR"
[49] "L_tot"                "LP1.1"                "LP1.2"
[52] "LP1.3"                "LP1.4"                "LP_tot"
[55] "LO1.1"                "LO1.2"                "LO1.3"
[58] "LO1.4"                "LO_tot"               "LU1.1"
[61] "LU1.2"                "LU1.3"                "LU_tot"
[64] "LR_tot"               "SP_tot"               "SP1.1"
[67] "SP1.2"                "SP1.3"                "SP1.4"
[70] "SP_tot.1"             "SO1.1"                "SO1.2"
[73] "SO1.3"                "SO1.4"                "SO_tot"
[76] "SU1.1"                "SU1.2"                "SU1.3"
[79] "SU_tot"               "SR"

Thank you in advance,
Stephane Vaucher

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

Reply via email to