Greetings.  I was reading through the vignette for "tidy-data" (from the
"tidyr" package) and came across something that puzzled me.

One of the examples in the vignette uses a data set related to tuberculosis,
originally from the World Health Organization, but also available at:

Here's the code:


> library(dplyr)  #### for tbl_df
> library(tidyr)  #### for gather
> tb <- tbl_df(read.csv("tb.csv", stringsAsFactors=FALSE))

> tb2 <- tb %>%
+     gather(demo, n, -iso2, -year, na.rm=TRUE)

> str(tb2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 35750 obs. of  4 variables:
 $ iso2: chr  "AD" "AD" "AD" "AE" ...
 $ year: int  2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
 $ demo: Factor w/ 20 levels "m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ n   : int  0 0 0 0 0 0 0 0 1 0 ...


I thought it might be interesting to see how to do this using the "reshape2"
package.  Here's the code for that:



tb2a <- tb %>%
        id.vars=c("iso2", "year"),"demo","n",
tb2a <- tbl_df(tb2a)

> str(tb2a)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 35750 obs. of  4 variables:
 $ iso2: chr  "AD" "AD" "AD" "AE" ...
 $ year: int  2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
 $ demo: Factor w/ 20 levels "m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ n   : int  0 0 0 0 0 0 0 0 1 0 ...


The "str" results make it appear that I'm on the right track, but it's always
good to double check:


> all.equal(tb2, tb2a)
[1] "Rows in x but not y: 34659, 34658, 34656, 34655, 34651, 34650, 34649,
34648, 34647, 34646, 32264[...]Rows in y but not x: 35663, 34658, 34657,
34656, 34655, 34652, 34651, 34650, 34649, 32265, 32264[...]"


Hmm.  Not what I'd hoped for, but all the simple, visual tests I did did not
show any differences.  After a little trial and error, I found the place where
the results differ:


> ROWS <- 2356
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] TRUE
> ROWS <- 2357
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] "Rows in x but not y: 2357Rows in y but not x: 2357"


OK, let's have a look at the spot where things go off the rails:


> tb2[2357, ]
Source: local data frame [1 x 4]

  iso2 year demo n
1   NA 1995 m014 0
> tb2a[2357, ]
Source: local data frame [1 x 4]

  iso2 year demo n
1   NA 1995 m014 0


Things certainly *look* the same, but:


> all.equal(tb2[2357, ], tb2a[2357, ])
[1] "Rows in x but not y: 1Rows in y but not x: 1"


If you guessed that it's the NA that's the source of the problem, you're
evidently correct:


> head(which([ , "iso2"])))
[1] 2357 2358 2359 2360 2361 2362


But I don't understand what the problem is.  The "all.equal" function does
appear to deal appropriately with NA's.  Here's a trivial example:


> library(pryr)

Attaching package: ‘pryr’

The following object is masked from ‘package:dplyr’:


> foo <- c(3, NA, 7)
> bar <- c(3, NA, 7)
> address(foo)  #### note that foo and bar are distinct objects
[1] "0x422c278"
> address(bar)
[1] "0x4953188"
> all.equal(foo, bar)  #### but they're still equal, even with NA
[1] TRUE


And just to be sure, I checked that these really are NA's in foo and bar:


> any(
[1] TRUE
> any(
[1] TRUE


It finally occurred to me to strip off the extra class attributes and do the


> all.equal(data.frame(tb2), data.frame(tb2a))
[1] TRUE


So this is evidently a "solution" to the problem, but I don't know what the
moral of the story is.  If you have any insights, please pass 'em along.


-- Mike

______________________________________________ mailing list -- To UNSUBSCRIBE and more, see
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.

Reply via email to