Greetings. I was reading through the vignette for "tidy-data" (from the "tidyr" package) and came across something that puzzled me.
One of the examples in the vignette uses a data set related to tuberculosis, originally from the World Health Organization, but also available at: https://github.com/hadley/tidy-data/blob/master/data/tb.csv Here's the code: +++++ > library(dplyr) #### for tbl_df > library(tidyr) #### for gather > tb <- tbl_df(read.csv("tb.csv", stringsAsFactors=FALSE)) > tb2 <- tb %>% + gather(demo, n, -iso2, -year, na.rm=TRUE) > str(tb2) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 35750 obs. of 4 variables: $ iso2: chr "AD" "AD" "AD" "AE" ... $ year: int 2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ... $ demo: Factor w/ 20 levels "m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ... $ n : int 0 0 0 0 0 0 0 0 1 0 ... > ----- I thought it might be interesting to see how to do this using the "reshape2" package. Here's the code for that: +++++ library(reshape2) tb2a <- tb %>% melt( id.vars=c("iso2", "year"), variable.name="demo", value.name="n", na.rm=TRUE) tb2a <- tbl_df(tb2a) > str(tb2a) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 35750 obs. of 4 variables: $ iso2: chr "AD" "AD" "AD" "AE" ... $ year: int 2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ... $ demo: Factor w/ 20 levels "m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ... $ n : int 0 0 0 0 0 0 0 0 1 0 ... > ----- The "str" results make it appear that I'm on the right track, but it's always good to double check: +++++ > all.equal(tb2, tb2a) [1] "Rows in x but not y: 34659, 34658, 34656, 34655, 34651, 34650, 34649, 34648, 34647, 34646, 32264[...]Rows in y but not x: 35663, 34658, 34657, 34656, 34655, 34652, 34651, 34650, 34649, 32265, 32264[...]" > ----- Hmm. Not what I'd hoped for, but all the simple, visual tests I did did not show any differences. After a little trial and error, I found the place where the results differ: +++++ > ROWS <- 2356 > all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ]) [1] TRUE > ROWS <- 2357 > all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ]) [1] "Rows in x but not y: 2357Rows in y but not x: 2357" ----- OK, let's have a look at the spot where things go off the rails: +++++ > tb2[2357, ] Source: local data frame [1 x 4] iso2 year demo n 1 NA 1995 m014 0 > tb2a[2357, ] Source: local data frame [1 x 4] iso2 year demo n 1 NA 1995 m014 0 > ----- Things certainly *look* the same, but: +++++ > all.equal(tb2[2357, ], tb2a[2357, ]) [1] "Rows in x but not y: 1Rows in y but not x: 1" > ----- If you guessed that it's the NA that's the source of the problem, you're evidently correct: +++++ > head(which(is.na(tb2[ , "iso2"]))) [1] 2357 2358 2359 2360 2361 2362 > ----- But I don't understand what the problem is. The "all.equal" function does appear to deal appropriately with NA's. Here's a trivial example: +++++ > library(pryr) Attaching package: ‘pryr’ The following object is masked from ‘package:dplyr’: %.% > foo <- c(3, NA, 7) > bar <- c(3, NA, 7) > address(foo) #### note that foo and bar are distinct objects [1] "0x422c278" > address(bar) [1] "0x4953188" > all.equal(foo, bar) #### but they're still equal, even with NA [1] TRUE > ----- And just to be sure, I checked that these really are NA's in foo and bar: +++++ > any(is.na(foo)) [1] TRUE > any(is.na(bar)) [1] TRUE > ----- It finally occurred to me to strip off the extra class attributes and do the comparison: +++++ > all.equal(data.frame(tb2), data.frame(tb2a)) [1] TRUE > ----- So this is evidently a "solution" to the problem, but I don't know what the moral of the story is. If you have any insights, please pass 'em along. Thanks. -- Mike ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.