Gabor, My f2 was just wrong. It should have been f2 <- function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) } which would be roughly the same as your f1 <- function(x, n=2) ave(x,x,FUN=length)>=n and flags all elements of x with >= n repetitions.
ave() involves a call to factor, which folks on R-devel have been fiddling with to change how it works with close-together numbers, so its results may vary with the version of R. The ix<-match(x,x) is a way to avoid the dependency on factor. For very long vectors with few duplicates tabulate is faster than then many calls to length in ave and I think f2 uses less memory because of the lists involved in the calls to split and lapply in ave. E.g., on a pretty old Linux machine: > x<-c(1:5e5,5,5,5,7,7,2) > which(f2(x)) [1] 2 5 7 500001 500002 500003 500004 500005 500006 > which(f1(x)) [1] 2 5 7 500001 500002 500003 500004 500005 500006 > system.time(f1(x)) user system elapsed 23.726 0.250 23.999 > system.time(f2(x)) user system elapsed 0.639 0.003 0.642 ave() is certainly easier to understand. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com > -----Original Message----- > From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] > Sent: Thursday, May 14, 2009 2:47 PM > To: William Dunlap > Cc: Bert Gunter; christiaan pauw; r-help@r-project.org > Subject: Re: [R] Duplicates and duplicated > > I don't think that that is the conclusion. > > All the solutions solve the original problem and the additional > "requirements" may or may not be what is wanted in any > particular case. > > The ave solution propagates the NA which seems like > the right thing to do whereas the f2 solution and the > duplicated solutions labels it FALSE which seems > wrong (though it may be right if that were wanted). > Also, the f2 solution does not pick up the 3 at the end > but again that may or may not be wanted. > > > x <- c(1, 2, 3, NA, 10, 6, 3) > > ave(x, x, FUN = length) > 1 > [1] FALSE FALSE TRUE NA FALSE FALSE TRUE > > > f2(x) > [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE > > > duplicated(x) | duplicated(x, fromLast=TRUE) > [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE > > so it all depends on what you want. > > > On Thu, May 14, 2009 at 1:43 PM, William Dunlap > <wdun...@tibco.com> wrote: > > The table()-based solution can have problems when there are > > very closely spaced floating point numbers in x, as in > > x1<-c(1, 1-.Machine$double.eps, > 1+2*.Machine$double.eps)[c(1,2,3,2,3)] > > It also relies on table(x) turning x into a factor with the default > > levels=as.character(sort(x)) and that default may change. > > It omits NA's from the result. (I think it also ought to > put the results in > > the original order of the data, so one can, e.g., omit or > select values > > which are duplicated.) > > > > The ave()-based solution fails when there are NA's or NaN's > in the data. > > x2 <- c(1,2,3,NA,10,6,3) > > > > The ave()-based solution can be slower than necessary on > long datasets, > > especially ones with few or no duplicates. > > x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17] > > > > I think the following function avoids these problems. It > never converts > > the data to character, but uses match() on the original > data to convert > > it to a set of unique integers that tabulate can handle. > > > > f2 <- function(x){ > > ix<-match(x,x) > > tix<-tabulate(ix) > > retval<-logical(length(x)) > > retval[which(tix!=1)]<-TRUE > > retval > > } > > > > Bill Dunlap > > TIBCO Software Inc - Spotfire Division > > wdunlap tibco.com > > > >> -----Original Message----- > >> From: r-help-boun...@r-project.org > >> [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter > >> Sent: Thursday, May 14, 2009 9:10 AM > >> To: 'Gabor Grothendieck'; 'christiaan pauw' > >> Cc: r-help@r-project.org > >> Subject: Re: [R] Duplicates and duplicated > >> > >> ... or, similar in character to Gabor's solution: > >> > >> tbl <- table(x) > >> (tbl[as.character(sort(x))]>1)+0 > >> > >> > >> Bert Gunter > >> Nonclinical Biostatistics > >> 467-7374 > >> > >> -----Original Message----- > >> From: r-help-boun...@r-project.org > >> [mailto:r-help-boun...@r-project.org] On > >> Behalf Of Gabor Grothendieck > >> Sent: Thursday, May 14, 2009 7:34 AM > >> To: christiaan pauw > >> Cc: r-help@r-project.org > >> Subject: Re: [R] Duplicates and duplicated > >> > >> Noting that: > >> > >> > ave(x, x, FUN = length) > 1 > >> [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE > >> > >> try this: > >> > >> > rbind(x, dup = ave(x, x, FUN = length) > 1) > >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > >> x 1 2 3 4 4 5 6 7 8 9 > >> dup 0 0 0 1 1 0 0 0 0 0 > >> > >> > >> On Thu, May 14, 2009 at 2:16 AM, christiaan pauw > >> <cjp...@gmail.com> wrote: > >> > Hi everybody. > >> > I want to identify not only duplicate number but also the > >> original number > >> > that has been duplicated. > >> > Example: > >> > x=c(1,2,3,4,4,5,6,7,8,9) > >> > y=duplicated(x) > >> > rbind(x,y) > >> > > >> > gives: > >> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > >> > x 1 2 3 4 4 5 6 7 8 9 > >> > y 0 0 0 0 1 0 0 0 0 0 > >> > > >> > i.e. the second 4 [,5] is a duplicate. > >> > > >> > What I want is the first and second 4. i.e [,4] and [,5] > to be TRUE > >> > > >> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > >> > x 1 2 3 4 4 5 6 7 8 9 > >> > y 0 0 0 1 1 0 0 0 0 0 > >> > > >> > I assume it can be done by sorting the vector and then > >> checking is the > >> next > >> > or the previous entry matches using > >> > identical() . I am just unsure on how to write such a loop > >> the logic of > >> > which (I think) is as follows: > >> > > >> > sort x > >> > for every value of x check if the next value is identical > >> and return TRUE > >> > (or 1) if it is and FALSE (or 0) if it is not > >> > AND > >> > check is the previous value is identical and return TRUE > >> (or 1) if it is > >> and > >> > FALSE (or 0) if it is not > >> > > >> > Im i thinking correct and can some help to write such a function > >> > > >> > regards > >> > Christiaan > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > ______________________________________________ > >> > R-help@r-project.org mailing list > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, > reproducible code. > >> > > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.