> -----Original Message----- > From: Bert Gunter [mailto:gunter.ber...@gene.com] > Sent: Thursday, May 14, 2009 2:31 PM > To: William Dunlap; 'Gabor Grothendieck'; 'christiaan pauw'; > 'jim holtman' > Cc: r-help@r-project.org > Subject: RE: [R] Duplicates and duplicated > > > Thanks, Bill. I also had some concerns about how reliable > numeric values > converted to character might be, so I'm glad to have an authoritative > criticism. Of course, I was really just being cute with R's > versatility. > > But Jim Holtman's solution seems like the best way to go, > anyway, does it > not?
That was f3 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE) which is equivalent to function(x) duplicated(x) | rev(duplicated(rev(x))) in S+, which doesn't have the fromLast= argument. It avoids the problems involved in table() and ave(), but it just seems sneaky to me. Linlin Yan's f4 <- function(x) x %in% x[duplicated(x)] seems to me more direct and also avoids those problems. Mine was wrong. It fails on x <- c(1, 2, 8, 2, 4, 5, 10, 1, 4, 16, 2) My intent was to provide one that would generalize to identifiying all elements that had n or more repetitions in the input vector. (E.g., you may want to drop from some analysis subjects with fewer than 5 observations on them.) The corrected version is f2<-function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) } E.g., > rbind(x, f2(x), f3(x), f4(x)) # identify duplicated entries [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] x 1 2 8 2 4 5 10 1 4 16 2 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 > rbind(x, f2(x, n=3)) # find ones with >= 3 reps [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] x 1 2 8 2 4 5 10 1 4 16 2 0 1 0 1 0 0 0 0 0 0 1 > > -- Bert > > Bert Gunter > Genentech Nonclinical Biostatistics > > > -----Original Message----- > From: William Dunlap [mailto:wdun...@tibco.com] > Sent: Thursday, May 14, 2009 10:44 AM > To: Bert Gunter; Gabor Grothendieck; christiaan pauw > Cc: r-help@r-project.org > Subject: RE: [R] Duplicates and duplicated > > The table()-based solution can have problems when there are > very closely spaced floating point numbers in x, as in > x1<-c(1, 1-.Machine$double.eps, > 1+2*.Machine$double.eps)[c(1,2,3,2,3)] > It also relies on table(x) turning x into a factor with the default > levels=as.character(sort(x)) and that default may change. > It omits NA's from the result. (I think it also ought to put > the results in > the original order of the data, so one can, e.g., omit or > select values > which are duplicated.) > > The ave()-based solution fails when there are NA's or NaN's > in the data. > x2 <- c(1,2,3,NA,10,6,3) > > The ave()-based solution can be slower than necessary on long > datasets, > especially ones with few or no duplicates. > x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17] > > I think the following function avoids these problems. It > never converts > the data to character, but uses match() on the original data > to convert > it to a set of unique integers that tabulate can handle. > > f2 <- function(x){ > ix<-match(x,x) > tix<-tabulate(ix) > retval<-logical(length(x)) > retval[which(tix!=1)]<-TRUE > retval > } > > Bill Dunlap > TIBCO Software Inc - Spotfire Division > wdunlap tibco.com > > > -----Original Message----- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter > > Sent: Thursday, May 14, 2009 9:10 AM > > To: 'Gabor Grothendieck'; 'christiaan pauw' > > Cc: r-help@r-project.org > > Subject: Re: [R] Duplicates and duplicated > > > > ... or, similar in character to Gabor's solution: > > > > tbl <- table(x) > > (tbl[as.character(sort(x))]>1)+0 > > > > > > Bert Gunter > > Nonclinical Biostatistics > > 467-7374 > > > > -----Original Message----- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On > > Behalf Of Gabor Grothendieck > > Sent: Thursday, May 14, 2009 7:34 AM > > To: christiaan pauw > > Cc: r-help@r-project.org > > Subject: Re: [R] Duplicates and duplicated > > > > Noting that: > > > > > ave(x, x, FUN = length) > 1 > > [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE > > > > try this: > > > > > rbind(x, dup = ave(x, x, FUN = length) > 1) > > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > > x 1 2 3 4 4 5 6 7 8 9 > > dup 0 0 0 1 1 0 0 0 0 0 > > > > > > On Thu, May 14, 2009 at 2:16 AM, christiaan pauw > > <cjp...@gmail.com> wrote: > > > Hi everybody. > > > I want to identify not only duplicate number but also the > > original number > > > that has been duplicated. > > > Example: > > > x=c(1,2,3,4,4,5,6,7,8,9) > > > y=duplicated(x) > > > rbind(x,y) > > > > > > gives: > > > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > > > x 1 2 3 4 4 5 6 7 8 9 > > > y 0 0 0 0 1 0 0 0 0 0 > > > > > > i.e. the second 4 [,5] is a duplicate. > > > > > > What I want is the first and second 4. i.e [,4] and [,5] > to be TRUE > > > > > > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > > > x 1 2 3 4 4 5 6 7 8 9 > > > y 0 0 0 1 1 0 0 0 0 0 > > > > > > I assume it can be done by sorting the vector and then > > checking is the > > next > > > or the previous entry matches using > > > identical() . I am just unsure on how to write such a loop > > the logic of > > > which (I think) is as follows: > > > > > > sort x > > > for every value of x check if the next value is identical > > and return TRUE > > > (or 1) if it is and FALSE (or 0) if it is not > > > AND > > > check is the previous value is identical and return TRUE > > (or 1) if it is > > and > > > FALSE (or 0) if it is not > > > > > > Im i thinking correct and can some help to write such a function > > > > > > regards > > > Christiaan > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.