Re: [R] help with duplicates

Peter Dalgaard Fri, 05 Jun 2009 11:01:22 -0700

Chris Anderson wrote:

I have a large dataset that contain duplicate records. How do I identify and 
remove duplicate records?


Here's one way:

> aq <- airquality[sample(NROW(airquality), replace=TRUE),]
> any(duplicated(aq))
[1] TRUE
> which(duplicated(aq))

[1] 2 15 34 44 45 47 49 50 52 53 65 75 76 78 83 8688 90 91[20] 94 96 98 99 100 103 104 107 108 110 111 112 114 117 119 120 121122 124

[39] 125 126 127 129 130 132 133 135 137 140 141 143 145 146 147 151 152
> aqs <- subset(aq,!duplicated(aq))
> any(duplicated(aqs))
[1] FALSE
> dim(aqs)
[1] 98  6
> dim(aq)
[1] 153   6

For data frames wit many columns you might want to think more carefullyabout how you recognize duplicates and maybe uses a subset of columns.


--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalga...@biostat.ku.dk)              FAX: (+45) 35327907

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help with duplicates

Reply via email to