Dear Tom, I'm afraid that you're confusing the point-biserial correlation with the biserial correlation. It's the latter that polyserial() in the polycor package computes; the biserial correlation is a special case when the (ordered) categorical variable has just two categories.
Generally, biserial correlations will be larger than point-serial correlations, since the former estimates the correlation in an underlying bivariate normal distribution that has been binned. I haven't looked closely at your example, but suspect that the difference in sign is just due to the arbitrary specification of which category is higher than the other. As for a reference, ?polyserial gives one. I hope this helps, John -------------------------------- John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -------------------------------- > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Tom Willems > Sent: Monday, October 29, 2007 11:01 AM > To: r-help@r-project.org > Subject: [R] biserial correlation with pkg polycor > > Dear R-ussers, > > While looking for a way to calculate the association between > a countinuous and a binary variable, i found a procedure > called point biserial corralation. > Me, not being a mathematicion, i did my very best to > understand what it was all about, and then i found a easily > understandable paper (by steve > simon) on ow to calculate this. ref ## > http://www.childrens-mercy.org/stats/definitions/biserial.htm > (this page has the same example) Further i discovered the > polycor package in R. > Now i'm having troubles with the fact that the polycor pkg > never gives me the same output as the manuals aplication of > the formula. > > In the example below found, manualy r(biserial) = 0.49 > between fb an age, and ussing function polyserial (polycor > pkg) r(biserial) =-0.8591. > This is a rather big difference, no due to abriviation or > flootingpoints. > > Is there someone whom is familiar with biserial correlation, > and the appropriate way to calculate it? > > Kind regards, > Tom. > > here is the example, at the end is the R file. > > 1e I create the input > > > library(abind) > > library (polycor) > > ### data input > > no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8) > fb <- c(19, > > 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22, > 17) > > ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14, > > 12, > 18) > > age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly', > 'elderly', 'elderly', 'elderly', 'elderly', 'young', 'young', > 'young', 'young', 'young', 'young', 'young', 'young') > > > > dataset <- data.frame(no,fb,ss,age) > > dataset <- subset(dataset,select=c(fb:age)) > > nrow(dataset) > [1] 17 > > data_eld <- subset(dataset,age=='elderly',select=fb) > > data_young <- subset(dataset,age=='young',select=fb) > > > > here i calculate the R_bis (biserial corelation) manualy > > > ### point biserial correlation > > > > fb <- subset(dataset,select=fb) > > fb0 <- subset(dataset,age!='elderly',select=fb) > > fb1 <- subset(dataset,age=='elderly',select=fb) > > meanfb0 <- mean(fb0,na.rm=T) > > meanfb1 <- mean(fb1,na.rm=T) > > sdfb<- sd(dataset$fb,na.rm=T) > > > > ss <- subset(dataset, select=ss) > > ss0 <- subset(dataset,age!='elderly',select=ss) > > ss1 <- subset(dataset,age=='elderly',select=ss) > > meanss0 <- mean(ss0,na.rm=T) > > meanss1 <- mean(ss1,na.rm=T) > > sdss<- sd(dataset$ss,na.rm=T) > > age <- subset(dataset,select=age) > > n <- nrow(dataset) > > > this is the formula from ref ## > http://www.childrens-mercy.org/stats/definitions/biserial.htm > > > > R_bis <- function(x,x1,x0,n){ p <- (nrow(x1)/n) > >+ (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) > *sqrt(p*(1-p))) } > > this is the corrected formula from ref ## > http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient > >> R_bis2 <- function(x,x1,x0,n){ > + ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) > * (sqrt( > (nrow(x1)*nrow(x0))/(n*(n-1))))} > > > >> R_bis(fb,fb1,fb0,n) > > fb > >0.4798873 > > result in paper was 0.49 > > >> R_bis2(fb,fb1,fb0,n) > > fb > >0.4946565 > > equals result in paper 0.49 > > Then i use the polycor package, > function hetcor will give all the different correlation ressults > > > >> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE) > > > >Maximum-Likelihood Estimates > > > >Correlations/Type of Correlation: > > dataset$fb dataset.ss dataset.age > >dataset$fb 1 Pearson Polyserial > >dataset.ss 0.703 1 Polyserial > >dataset.age -0.8591 -0.6685 1 > > > >Standard Errors: > > dataset$fb dataset.ss > >dataset$fb > >dataset.ss 0.1215 > >dataset.age 0.1106 0.2497 > > > >n = 17 > > > >P-values for Tests of Bivariate Normality: > > dataset$fb dataset.ss > >dataset$fb > >dataset.ss 0.1782 > >dataset.age 0.4269 0.4034 > > hetcor(dataset,ML=TRUE) > > > >Maximum-Likelihood Estimates > > > >Correlations/Type of Correlation: > > fb ss age > >fb 1 Pearson Polyserial > >ss 0.703 1 Polyserial > >age -0.8591 -0.6685 1 > > > >Standard Errors: > > fb ss > >fb > >ss 0.1215 > >age 0.1106 0.2497 > > > >n = 17 > > > >P-values for Tests of Bivariate Normality: > > fb ss > >fb > >ss 0.1782 > >age 0.4269 0.4034 > > here a quick two step method is ussed to calculate the polyserial > correlation > > > >polyserial(dataset$fb,dataset$age) > >[1] -0.6205737 > >> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) > > same method as in hetcor, only for indecated variables > > >Polyserial Correlation, ML est. = -0.8591 (0.1106) > >Test of bivariate normality: Chisquare = 4.91, df = 5, p = 0.4269 > > > > 1 > >Threshold 0.1811 > >Std.Err. 0.1849 > > > > > > ### for side to side (ss) incase no 9 is an outlier in > fb, this will > not be the case in ss > > > >> R_bis(ss,ss1,ss0,n) > > ss > >0.4153681 > > result in paper was 0.43 > > >> R_bis2(ss,ss1,ss0,n) > > ss > >0.4281516 > > equals result in paper 0.43 > > >> polyserial(dataset$ss,dataset$age) > >[1] -0.5371397 > > >> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE) > > > >Polyserial Correlation, ML est. = -0.6685 (0.2497) > >Test of bivariate normality: Chisquare = 5.103, df = 5, p = 0.4034 > > > > 1 > >Threshold 0.1504 > >Std.Err. 0.2583 > ############################################################## > ################### > > ### R-file > ############################################################## > ################### > > library(abind) > library (polycor) > ### data input > no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8) > fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, > 14, 14, 22, > 17) > ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, > 12, 14, 12, > 18) > age <-c('elderly', 'elderly', 'elderly', 'elderly', > 'elderly', 'elderly', > 'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 'young', > 'young', 'young', 'young', 'young') > > dataset <- data.frame(no,fb,ss,age) > dataset <- subset(dataset,select=c(fb:age)) > nrow(dataset) > data_eld <- subset(dataset,age=='elderly',select=fb) > data_young <- subset(dataset,age=='young',select=fb) > > ### point biserial correlation > > fb <- subset(dataset,select=fb) > fb0 <- subset(dataset,age!='elderly',select=fb) > fb1 <- subset(dataset,age=='elderly',select=fb) > meanfb0 <- mean(fb0,na.rm=T) > meanfb1 <- mean(fb1,na.rm=T) > sdfb<- sd(dataset$fb,na.rm=T) > > ss <- subset(dataset, select=ss) > ss0 <- subset(dataset,age!='elderly',select=ss) > ss1 <- subset(dataset,age=='elderly',select=ss) > meanss0 <- mean(ss0,na.rm=T) > meanss1 <- mean(ss1,na.rm=T) > sdss<- sd(dataset$ss,na.rm=T) > age <- subset(dataset,select=age) > n <- nrow(dataset) > > R_bis <- function(x,x1,x0,n){ p <- (nrow(x1)/n) > (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) > *sqrt(p*(1-p))) } > R_bis2 <- function(x,x1,x0,n){ > ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) * (sqrt( > (nrow(x1)*nrow(x0))/(n*(n-1))))} > > R_bis(fb,fb1,fb0,n) > R_bis2(fb,fb1,fb0,n) > > > hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE) > hetcor(dataset,ML=TRUE) > > polyserial(dataset$fb,dataset$age) > polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) > > ### for side to side (ss) incase no 9 is an outlier in fb, > this will not > be the case in ss > > R_bis(ss,ss1,ss0,n) > R_bis2(ss,ss1,ss0,n) > > polyserial(dataset$ss,dataset$age) > polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE) > > ## END > ############################################################## > ############ > > Willems Tom > > E-mail: [EMAIL PROTECTED] > > > > > Disclaimer: click here > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.