Hi everybody
I have found something (for me at least) strange with duplicated(). I will
first provide a replicable example of a certain kind of behaviour that I
find odd and then give a sample of unexpected results from my own data. I
hope someone can help me understand this.
Consider the following
# this works as expected
ex=sample(1:20, replace=TRUE)
ex
duplicated(ex)
ex=sort(ex)
ex
duplicated(ex)
# but why does duplicate not work after order() ?
ex=sample(1:20, replace=TRUE)
ex
duplicated(ex)
ex=order(ex)
duplicated(ex)
Why does duplicated() not work after order() has been applied but it works
fine after sort() ? Is this an error or is there something I don't
understand.
I have been getting very strage results from duplicated() and unique() in a
dataset I am analysing. Her is a little sample of my real life problem
> str(Masechaba$PROPDESC)
Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043 16113
16054 13875 15780 12522 7771 14824 12314 ...
> # Create a indicator if the PROPDESC is unique. Default false
> Masechaba$unique=FALSE
> Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
> # Check is something happended
> length(which(Masechaba$unique==TRUE))
[1] 2174
> length(which(Masechaba$unique==FALSE))
[1] 476
> Masechaba$duplicate=FALSE
> Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE
> length(which(Masechaba$duplicate==TRUE))
[1] 476
> length(which(Masechaba$duplicate==FALSE))
[1] 2174
> # Looks OK so far
> # Test on a known duplicate. I expect one to be true and one to be false
> Masechaba[which(Masechaba$PROPDESC==2363),10:12]
PROPDESC unique duplicate
24874 2363 TRUE FALSE
31280 2363 TRUE TRUE
# This is strange. I expected that unique() and duplicate() would give the
same results. The variable PROPDESC is clearly not unique in both cases.
# The totals are the same but not the individual results
> table(Masechaba$unique,Masechaba$duplicate)
FALSE TRUE
FALSE 342 134
TRUE 1832 342
I don't understand this. Is there something I am missing?
Best regards
Christaan
P.S
> sessionInfo()
R version 2.11.1 (2010-05-31)
x86_64-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
base
other attached packages:
[1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40
Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26
[8] sp_0.9-64
loaded via a namespace (and not attached):
[1] cluster_1.12.3 grid_2.11.1 tools_2.11.1
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.