Sorry Jeff, I did not finish my email. I accidentally touched the send button. My question was the when I used this one length(unique(result2$first)) vs dim(result2[!duplicated(result2[,c('first')]),]) [1]
I did get different results but now I found out the problem. Thank you!. On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > Your question mystifies me, since it looks to me like you already know the > answer. > -- > Sent from my phone. Please excuse my brevity. > > On February 12, 2017 3:30:49 PM PST, Val <valkr...@gmail.com> wrote: >>Hi Jeff and all, >> How do I get the number of unique first names in the two data sets? >> >>for the first one, >>result2 <- DF[ 1 == err2, ] >>length(unique(result2$first)) >> >> >> >> >>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller >><jdnew...@dcn.davis.ca.us> wrote: >>> The "by" function aggregates and returns a result with generally >>fewer rows >>> than the original data. Since you are looking to index the rows in >>the >>> original data set, the "ave" function is better suited because it >>always >>> returns a vector that is just as long as the input vector: >>> >>> # I usually work with character data rather than factors if I plan >>> # to modify the data (e.g. removing rows) >>> DF <- read.table( text= >>> 'first week last >>> Alex 1 West >>> Bob 1 John >>> Cory 1 Jack >>> Cory 2 Jack >>> Bob 2 John >>> Bob 3 John >>> Alex 2 Joseph >>> Alex 3 West >>> Alex 4 West >>> ', header = TRUE, as.is = TRUE ) >>> >>> err <- ave( DF$last >>> , DF[ , "first", drop = FALSE] >>> , FUN = function( lst ) { >>> length( unique( lst ) ) >>> } >>> ) >>> result <- DF[ "1" == err, ] >>> result >>> >>> Notice that the ave function returns a vector of the same type as was >>given >>> to it, so even though the function returns a numeric the err >>> vector is character. >>> >>> If you wanted to be able to examine more than one other column in >>> determining the keep/reject decision, you could do: >>> >>> err2 <- ave( seq_along( DF$first ) >>> , DF[ , "first", drop = FALSE] >>> , FUN = function( n ) { >>> length( unique( DF[ n, "last" ] ) ) >>> } >>> ) >>> result2 <- DF[ 1 == err2, ] >>> result2 >>> >>> and then you would have the option to re-use the "n" index to look at >>other >>> columns as well. >>> >>> Finally, here is a dplyr solution: >>> >>> library(dplyr) >>> result3 <- ( DF >>> %>% group_by( first ) # like a prep for ave or by >>> %>% mutate( err = length( unique( last ) ) ) # similar to >>ave >>> %>% filter( 1 == err ) # drop the rows with too many last >>names >>> %>% select( -err ) # drop the temporary column >>> %>% as.data.frame # convert back to a plain-jane data >>frame >>> ) >>> result3 >>> >>> which uses a small set of verbs in a pipeline of functions to go from >>input >>> to result in one pass. >>> >>> If your data set is really big (running out of memory big) then you >>might >>> want to investigate the data.table or sqlite packages, either of >>which can >>> be combined with dplyr to get a standardized syntax for managing >>larger >>> amounts of data. However, most people actually aren't running out of >>memory >>> so in most cases the extra horsepower isn't actually needed. >>> >>> >>> On Sun, 12 Feb 2017, P Tennant wrote: >>> >>>> Hi Val, >>>> >>>> The by() function could be used here. With the dataframe dfr: >>>> >>>> # split the data by first name and check for more than one last name >>for >>>> each first name >>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>>> # make the result more easily manipulated >>>> res <- as.table(res) >>>> res >>>> # first >>>> # Alex Bob Cory >>>> # TRUE FALSE FALSE >>>> >>>> # then use this result to subset the data >>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>>> # sort if needed >>>> nw.dfr[order(nw.dfr$first) , ] >>>> >>>> first week last >>>> 2 Bob 1 John >>>> 5 Bob 2 John >>>> 6 Bob 3 John >>>> 3 Cory 1 Jack >>>> 4 Cory 2 Jack >>>> >>>> >>>> Philip >>>> >>>> On 12/02/2017 4:02 PM, Val wrote: >>>>> >>>>> Hi all, >>>>> I have a big data set and want to remove rows conditionally. >>>>> In my data file each person were recorded for several weeks. >>Somehow >>>>> during the recording periods, their last name was misreported. >>For >>>>> each person, the last name should be the same. Otherwise remove >>from >>>>> the data. Example, in the following data set, Alex was found to >>have >>>>> two last names . >>>>> >>>>> Alex West >>>>> Alex Joseph >>>>> >>>>> Alex should be removed from the data. if this happens then I want >>>>> remove all rows with Alex. Here is my data set >>>>> >>>>> df<- read.table(header=TRUE, text='first week last >>>>> Alex 1 West >>>>> Bob 1 John >>>>> Cory 1 Jack >>>>> Cory 2 Jack >>>>> Bob 2 John >>>>> Bob 3 John >>>>> Alex 2 Joseph >>>>> Alex 3 West >>>>> Alex 4 West ') >>>>> >>>>> Desired output >>>>> >>>>> first week last >>>>> 1 Bob 1 John >>>>> 2 Bob 2 John >>>>> 3 Bob 3 John >>>>> 4 Cory 1 Jack >>>>> 5 Cory 2 Jack >>>>> >>>>> Thank you in advance >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> >>--------------------------------------------------------------------------- >>> Jeff Newmiller The ..... ..... Go >>Live... >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>Go... >>> Live: OO#.. Dead: OO#.. >>Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>rocks...1k >>> >>--------------------------------------------------------------------------- ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.