Chris Anderson wrote:
I have a large dataset that contain duplicate records. How do I identify and
remove duplicate records?
Here's one way:
> aq <- airquality[sample(NROW(airquality), replace=TRUE),]
> any(duplicated(aq))
[1] TRUE
> which(duplicated(aq))
[1] 2 15 34 44 45 47 49 50 52 53 65 75 76 78 83 86
88 90 91
[20] 94 96 98 99 100 103 104 107 108 110 111 112 114 117 119 120 121
122 124
[39] 125 126 127 129 130 132 133 135 137 140 141 143 145 146 147 151 152
> aqs <- subset(aq,!duplicated(aq))
> any(duplicated(aqs))
[1] FALSE
> dim(aqs)
[1] 98 6
> dim(aq)
[1] 153 6
For data frames wit many columns you might want to think more carefully
about how you recognize duplicates and maybe uses a subset of columns.
--
O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.