Hi Jeff and all, How do I get the number of unique first names in the two data sets?
for the first one, result2 <- DF[ 1 == err2, ] length(unique(result2$first)) On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > The "by" function aggregates and returns a result with generally fewer rows > than the original data. Since you are looking to index the rows in the > original data set, the "ave" function is better suited because it always > returns a vector that is just as long as the input vector: > > # I usually work with character data rather than factors if I plan > # to modify the data (e.g. removing rows) > DF <- read.table( text= > 'first week last > Alex 1 West > Bob 1 John > Cory 1 Jack > Cory 2 Jack > Bob 2 John > Bob 3 John > Alex 2 Joseph > Alex 3 West > Alex 4 West > ', header = TRUE, as.is = TRUE ) > > err <- ave( DF$last > , DF[ , "first", drop = FALSE] > , FUN = function( lst ) { > length( unique( lst ) ) > } > ) > result <- DF[ "1" == err, ] > result > > Notice that the ave function returns a vector of the same type as was given > to it, so even though the function returns a numeric the err > vector is character. > > If you wanted to be able to examine more than one other column in > determining the keep/reject decision, you could do: > > err2 <- ave( seq_along( DF$first ) > , DF[ , "first", drop = FALSE] > , FUN = function( n ) { > length( unique( DF[ n, "last" ] ) ) > } > ) > result2 <- DF[ 1 == err2, ] > result2 > > and then you would have the option to re-use the "n" index to look at other > columns as well. > > Finally, here is a dplyr solution: > > library(dplyr) > result3 <- ( DF > %>% group_by( first ) # like a prep for ave or by > %>% mutate( err = length( unique( last ) ) ) # similar to ave > %>% filter( 1 == err ) # drop the rows with too many last names > %>% select( -err ) # drop the temporary column > %>% as.data.frame # convert back to a plain-jane data frame > ) > result3 > > which uses a small set of verbs in a pipeline of functions to go from input > to result in one pass. > > If your data set is really big (running out of memory big) then you might > want to investigate the data.table or sqlite packages, either of which can > be combined with dplyr to get a standardized syntax for managing larger > amounts of data. However, most people actually aren't running out of memory > so in most cases the extra horsepower isn't actually needed. > > > On Sun, 12 Feb 2017, P Tennant wrote: > >> Hi Val, >> >> The by() function could be used here. With the dataframe dfr: >> >> # split the data by first name and check for more than one last name for >> each first name >> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >> # make the result more easily manipulated >> res <- as.table(res) >> res >> # first >> # Alex Bob Cory >> # TRUE FALSE FALSE >> >> # then use this result to subset the data >> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >> # sort if needed >> nw.dfr[order(nw.dfr$first) , ] >> >> first week last >> 2 Bob 1 John >> 5 Bob 2 John >> 6 Bob 3 John >> 3 Cory 1 Jack >> 4 Cory 2 Jack >> >> >> Philip >> >> On 12/02/2017 4:02 PM, Val wrote: >>> >>> Hi all, >>> I have a big data set and want to remove rows conditionally. >>> In my data file each person were recorded for several weeks. Somehow >>> during the recording periods, their last name was misreported. For >>> each person, the last name should be the same. Otherwise remove from >>> the data. Example, in the following data set, Alex was found to have >>> two last names . >>> >>> Alex West >>> Alex Joseph >>> >>> Alex should be removed from the data. if this happens then I want >>> remove all rows with Alex. Here is my data set >>> >>> df<- read.table(header=TRUE, text='first week last >>> Alex 1 West >>> Bob 1 John >>> Cory 1 Jack >>> Cory 2 Jack >>> Bob 2 John >>> Bob 3 John >>> Alex 2 Joseph >>> Alex 3 West >>> Alex 4 West ') >>> >>> Desired output >>> >>> first week last >>> 1 Bob 1 John >>> 2 Bob 2 John >>> 3 Bob 3 John >>> 4 Cory 1 Jack >>> 5 Cory 2 Jack >>> >>> Thank you in advance >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.