Hi Jeff and All, When I examined the excluded data, ie., first name with with different last names, I noticed that some last names were not recorded or instance, I modified the data as follows DF <- read.table( text= 'first week last Alex 1 West Bob 1 John Cory 1 Jack Cory 2 - Bob 2 John Bob 3 John Alex 2 Joseph Alex 3 West Alex 4 West ', header = TRUE, as.is = TRUE )
err2 <- ave( seq_along( DF$first ) , DF[ , "first", drop = FALSE] , FUN = function( n ) { length( unique( DF[ n, "last" ] ) ) } ) result2 <- DF[ 1 == err2, ] result2 first week last 2 Bob 1 John 5 Bob 2 John 6 Bob 3 John However, I want keep Cory's record. It is assumed that not recorded should have the same last name. Final out put should be first week last Bob 1 John Bob 2 John Bob 3 John Cory 1 Jack Cory 2 - Thank you again! On Sun, Feb 12, 2017 at 7:28 PM, Val <valkr...@gmail.com> wrote: > Sorry Jeff, I did not finish my email. I accidentally touched the send > button. > My question was the > when I used this one > length(unique(result2$first)) > vs > dim(result2[!duplicated(result2[,c('first')]),]) [1] > > I did get different results but now I found out the problem. > > Thank you!. > > > > > > > > > On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller > <jdnew...@dcn.davis.ca.us> wrote: >> Your question mystifies me, since it looks to me like you already know the >> answer. >> -- >> Sent from my phone. Please excuse my brevity. >> >> On February 12, 2017 3:30:49 PM PST, Val <valkr...@gmail.com> wrote: >>>Hi Jeff and all, >>> How do I get the number of unique first names in the two data sets? >>> >>>for the first one, >>>result2 <- DF[ 1 == err2, ] >>>length(unique(result2$first)) >>> >>> >>> >>> >>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller >>><jdnew...@dcn.davis.ca.us> wrote: >>>> The "by" function aggregates and returns a result with generally >>>fewer rows >>>> than the original data. Since you are looking to index the rows in >>>the >>>> original data set, the "ave" function is better suited because it >>>always >>>> returns a vector that is just as long as the input vector: >>>> >>>> # I usually work with character data rather than factors if I plan >>>> # to modify the data (e.g. removing rows) >>>> DF <- read.table( text= >>>> 'first week last >>>> Alex 1 West >>>> Bob 1 John >>>> Cory 1 Jack >>>> Cory 2 Jack >>>> Bob 2 John >>>> Bob 3 John >>>> Alex 2 Joseph >>>> Alex 3 West >>>> Alex 4 West >>>> ', header = TRUE, as.is = TRUE ) >>>> >>>> err <- ave( DF$last >>>> , DF[ , "first", drop = FALSE] >>>> , FUN = function( lst ) { >>>> length( unique( lst ) ) >>>> } >>>> ) >>>> result <- DF[ "1" == err, ] >>>> result >>>> >>>> Notice that the ave function returns a vector of the same type as was >>>given >>>> to it, so even though the function returns a numeric the err >>>> vector is character. >>>> >>>> If you wanted to be able to examine more than one other column in >>>> determining the keep/reject decision, you could do: >>>> >>>> err2 <- ave( seq_along( DF$first ) >>>> , DF[ , "first", drop = FALSE] >>>> , FUN = function( n ) { >>>> length( unique( DF[ n, "last" ] ) ) >>>> } >>>> ) >>>> result2 <- DF[ 1 == err2, ] >>>> result2 >>>> >>>> and then you would have the option to re-use the "n" index to look at >>>other >>>> columns as well. >>>> >>>> Finally, here is a dplyr solution: >>>> >>>> library(dplyr) >>>> result3 <- ( DF >>>> %>% group_by( first ) # like a prep for ave or by >>>> %>% mutate( err = length( unique( last ) ) ) # similar to >>>ave >>>> %>% filter( 1 == err ) # drop the rows with too many last >>>names >>>> %>% select( -err ) # drop the temporary column >>>> %>% as.data.frame # convert back to a plain-jane data >>>frame >>>> ) >>>> result3 >>>> >>>> which uses a small set of verbs in a pipeline of functions to go from >>>input >>>> to result in one pass. >>>> >>>> If your data set is really big (running out of memory big) then you >>>might >>>> want to investigate the data.table or sqlite packages, either of >>>which can >>>> be combined with dplyr to get a standardized syntax for managing >>>larger >>>> amounts of data. However, most people actually aren't running out of >>>memory >>>> so in most cases the extra horsepower isn't actually needed. >>>> >>>> >>>> On Sun, 12 Feb 2017, P Tennant wrote: >>>> >>>>> Hi Val, >>>>> >>>>> The by() function could be used here. With the dataframe dfr: >>>>> >>>>> # split the data by first name and check for more than one last name >>>for >>>>> each first name >>>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>>>> # make the result more easily manipulated >>>>> res <- as.table(res) >>>>> res >>>>> # first >>>>> # Alex Bob Cory >>>>> # TRUE FALSE FALSE >>>>> >>>>> # then use this result to subset the data >>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>>>> # sort if needed >>>>> nw.dfr[order(nw.dfr$first) , ] >>>>> >>>>> first week last >>>>> 2 Bob 1 John >>>>> 5 Bob 2 John >>>>> 6 Bob 3 John >>>>> 3 Cory 1 Jack >>>>> 4 Cory 2 Jack >>>>> >>>>> >>>>> Philip >>>>> >>>>> On 12/02/2017 4:02 PM, Val wrote: >>>>>> >>>>>> Hi all, >>>>>> I have a big data set and want to remove rows conditionally. >>>>>> In my data file each person were recorded for several weeks. >>>Somehow >>>>>> during the recording periods, their last name was misreported. >>>For >>>>>> each person, the last name should be the same. Otherwise remove >>>from >>>>>> the data. Example, in the following data set, Alex was found to >>>have >>>>>> two last names . >>>>>> >>>>>> Alex West >>>>>> Alex Joseph >>>>>> >>>>>> Alex should be removed from the data. if this happens then I want >>>>>> remove all rows with Alex. Here is my data set >>>>>> >>>>>> df<- read.table(header=TRUE, text='first week last >>>>>> Alex 1 West >>>>>> Bob 1 John >>>>>> Cory 1 Jack >>>>>> Cory 2 Jack >>>>>> Bob 2 John >>>>>> Bob 3 John >>>>>> Alex 2 Joseph >>>>>> Alex 3 West >>>>>> Alex 4 West ') >>>>>> >>>>>> Desired output >>>>>> >>>>>> first week last >>>>>> 1 Bob 1 John >>>>>> 2 Bob 2 John >>>>>> 3 Bob 3 John >>>>>> 4 Cory 1 Jack >>>>>> 5 Cory 2 Jack >>>>>> >>>>>> Thank you in advance >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> >>>--------------------------------------------------------------------------- >>>> Jeff Newmiller The ..... ..... Go >>>Live... >>>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>>Go... >>>> Live: OO#.. Dead: OO#.. >>>Playing >>>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>> /Software/Embedded Controllers) .OO#. .OO#. >>>rocks...1k >>>> >>>--------------------------------------------------------------------------- ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.