Hello R users, I'm writing a brief tutorial of getting statistical measures by splitting according strata and over columns. When I used plyr::ddply I got and unexpected result, with NA/NaN for non existing cells. Below is a minimal reproducible code with the result that I got. For comparison, the result of aggregate is showed. Why this behaviour? What I can do to avoid it?
> require(plyr) > > hab <- + read.table("http://www.leg.ufpr.br/~walmes/data/ipea_habitacao.csv", + header=TRUE, sep=",", stringsAsFactors=FALSE, quote="", + encoding="utf-8") > > hab <- hab[,-ncol(hab)] > names(hab) <- c("sig", "cod", "mun", "agua", "ener", "tel", "carro", + "comp", "tot") > hab <- transform(hab, sig=factor(sig)) > hab$siz <- cut(hab$tot, breaks=c(-Inf, 5000, Inf), + labels=c("P","G")) > str(hab, ve.len=1) 'data.frame': 5596 obs. of 10 variables: $ sig : Factor w/ 27 levels "AC","AL","AM",..: 1 1 1 1 1 1 1 1 1 1 ... $ cod : int 1200013 1200054 1200104 1200138 1200179 1200203 1200252 1200302 1200328 1200336 ... $ mun : chr "Acrelândia" "Assis Brasil" "Brasiléia" "Bujari" ... $ agua : num 21.5 27.4 26.9 17.3 13.1 ... $ ener : num 56.2 65.3 55.9 43.9 35.9 ... $ tel : num 8.85 26.71 22.73 12.28 9.19 ... $ carro: num 9.3 6.03 7.47 6.49 5.73 ... $ comp : num 0.947 1.637 1.857 0.127 0.088 ... $ tot : int 1878 807 4114 1365 1267 14368 2807 5268 740 2308 ... $ siz : Factor w/ 2 levels "P","G": 1 1 1 1 1 2 1 2 1 1 ... > > xtabs(~siz+sig, hab) sig siz AC AL AM AP BA CE DF ES GO MA MG MS MT PA PB PE PI PR RJ P 17 72 47 13 266 103 0 43 187 152 671 52 100 80 195 97 199 310 27 G 5 29 15 3 149 81 1 34 55 65 182 25 26 63 28 88 22 89 64 sig siz RN RO RR RS SC SE SP TO P 147 34 14 355 236 56 396 129 G 19 18 1 112 57 19 249 10 > > x <- ddply(hab, ~sig+siz, + colwise(.fun=mean, + .cols=~agua+ener+tel+carro+comp, na.rm=TRUE)) > head(x) sig siz agua ener tel carro comp 1 AC P 19.30229 51.30535 12.857118 5.395824 0.7028235 2 AC G 28.39300 65.95740 26.322800 8.942000 2.3806000 3 AL P 42.14337 81.20935 4.801500 7.069958 0.5332639 4 AL G 54.03966 87.47834 9.771724 9.428586 1.3583793 5 *AL <NA> NaN NaN NaN NaN NaN* 6 AM P 25.66202 61.12709 5.749596 1.980362 0.7629362 > > x <- ddply(hab, ~sig+siz, + colwise(.fun=sum, + .cols=~agua+ener+tel+carro+comp, na.rm=TRUE)) > head(x) sig siz agua ener tel carro comp 1 AC P 328.139 872.191 218.571 91.729 11.948 2 AC G 141.965 329.787 131.614 44.710 11.903 3 AL P 3034.323 5847.073 345.708 509.037 38.395 4 AL G 1567.150 2536.872 283.380 273.429 39.393 5 *AL <NA> 0.000 0.000 0.000 0.000 0.000* 6 AM P 1206.115 2872.973 270.231 93.077 35.858 > > y <- aggregate(as.matrix(hab[,4:8])~sig+siz, data=hab, FUN=mean, + na.rm=TRUE) > head(y) sig siz agua ener tel carro comp 1 AC P 19.30229 51.30535 12.857118 5.395824 0.7028235 2 AL P 42.14337 81.20935 4.801500 7.069958 0.5332639 3 AM P 25.66202 61.12709 5.749596 1.980362 0.7629362 4 AP P 37.20362 82.04515 14.929154 5.775923 0.7922308 5 BA P 41.23200 66.45516 4.524045 10.387203 0.8404624 6 CE P 36.78599 78.20176 6.339990 7.768981 0.6446990 > > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plyr_1.8.1 loaded via a namespace (and not attached): [1] compiler_3.1.1 Rcpp_0.11.2 tools_3.1.1 > Thanks in advance. Walmes. ========================================================================== Walmes Marques Zeviani LEG (Laboratório de Estatística e Geoinformação, 25.450418 S, 49.231759 W) Departamento de Estatística - Universidade Federal do Paraná fone: (+55) 41 3361 3573 skype: walmeszeviani homepage: http://www.leg.ufpr.br/~walmes linux user number: 531218 ========================================================================== [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.