Dear Colleagues, I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID.
WOW!!! I just saw Boris Steipe's answer to my question: olddata$first <- as.numeric(! duplicated(olddata$ID)) The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages. Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt. Many, many thanks, John John David Sorkin M.D., Ph.D. Professor of Medicine, University of Maryland School of Medicine; Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; Senior Statistician University of Maryland Center for Vascular Research; Division of Gerontology and Paliative Care, 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 Cell phone 443-418-5382 ________________________________________ From: Bert Gunter <bgunter.4...@gmail.com> Sent: Sunday, December 1, 2024 11:30 AM To: Rui Barradas Cc: Sorkin, John; r-help@r-project.org (r-help@r-project.org) Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows Rui: "f these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest." But the explicit version of diff is still considerably faster: > D <- c(rep(1,10),rep(2,6),rep(3,2)) > microbenchmark(c(1L,diff(D)), times = 1000L) Unit: microseconds expr min lq mean median uq max neval c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000 > microbenchmark( as.integer(!duplicated(D)), times =1000L) Unit: microseconds expr min lq mean median uq max neval as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000 > microbenchmark( D - c(0L, D[-length(D)]), times = 1000L) Unit: nanoseconds ## note that unit is nanoseconds not microseconds expr min lq mean median uq max neval D - c(0L, D[-length(D)]) 369 410 489.335 492 533 9840 1000 Cheers, Bert On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas <ruipbarra...@sapo.pt> wrote: > > Às 02:27 de 01/12/2024, Sorkin, John escreveu: > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list > > server. I am trying to learn how to manipulate data in R . . . and am > > having difficulty getting my program to work. I greatly appreciate the help > > and support list member give! > > > > I am trying to write a program that will run through a data frame organized > > by ID and for the first line of each new group of data lines that has the > > same ID create a new variable first that will be 1 for the first line of > > the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > > ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > > ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3 10 1 > > 3 10 0 > > > > When I run the program below, I receive the following error: > > Error in df[, "ID"] : incorrect number of dimensions > > > > My code: > > # Create data.frame > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > > rep(5,3),rep(6,3),rep(10,2)) > > olddata <- data.frame(ID=ID,date=date) > > class(olddata) > > cat("This is the original data frame","\n") > > print(olddata) > > > > # This function is supposed to identify the first row > > # within each level of ID and, for the first row, set > > # the variable first to 1, and for all rows other than > > # the first row set first to 0. > > mydoit <- function(df){ > > value <- ifelse (first(df[,"ID"]),1,0) > > cat("value=",value,"\n") > > df[,"first"] <- value > > } > > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit) > > > > Thank you, > > John > > > > > > John David Sorkin M.D., Ph.D. > > Professor of Medicine, University of Maryland School of Medicine; > > Associate Director for Biostatistics and Informatics, Baltimore VA Medical > > Center Geriatrics Research, Education, and Clinical Center; > > PI Biostatistics and Informatics Core, University of Maryland School of > > Medicine Claude D. Pepper Older Americans Independence Center; > > Senior Statistician University of Maryland Center for Vascular Research; > > > > Division of Gerontology and Paliative Care, > > 10 North Greene Street > > GRECC (BT/18/GR) > > Baltimore, MD 21201-1524 > > Cell phone 443-418-5382 > > > > > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > https://www.r-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > And here are two other solutions. > > > olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == > x[1L])) > > olddata$first <- c(1L, diff(olddata$ID)) > > > Of these two, diff is faster. But of all the solutions posted so far, > Ben Bolker's is the fastest. And it can be made a little faster if > as.integer substitutes for as.numeric. > And dplyr::mutate now has a .by argument, which avoids explicit the call > to group_by, with a performance gain. > > > library(microbenchmark) > > mb <- microbenchmark( > ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])), > dup_num = as.numeric(! duplicated(olddata$ID)), > dup_int = as.integer(! duplicated(olddata$ID)), > diff = diff = c(1L, diff(olddata$ID)), > dplyr_grp = olddata %>% group_by(ID) %>% mutate(first = > as.integer(row_number() == 1)), > dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by > = ID) > ) > print(mb, order = "median") > > > > However, note that dplyr operates in entire data.frames and therefore is > expected to be slower when tested against instructions that process one > column only. > > > Hope this helps, > > Rui Barradas > > > -- > Este e-mail foi analisado pelo software antivírus AVG para verificar a > presença de vírus. > http://www.avg.com/ > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.