The grouping solutions offered seem to be the obvious way to do this and may even be more efficient in R then what follows below. However, note that they are to some extent doing unnecessary work, since the ordering in the data frame already implicitly provides the grouping, and the hashing or whatever is under the hood of the grouping functions to determine this is therefore unnecessary.
So I was wondering how easy it would be to use purely elementary means to take advantage of this and avoid the "unnecessary" work. A more or less obvious approach that occurred to me was to use R's rle() function. I'll first give a prolix, step-by-step explanation for those who may not have used rle(). Then I'll give a concise version of code. Assume "dat" is the example data frame of two columns that John gave. Then: rle(dat[,1]) ##gives a list with two components: Run Length Encoding lengths: int [1:3] 10 6 2 values : int [1:3] 1 2 3 This gives us the grouping for the ID column: 10 1's, followed by 6 2's, followed by 2 3's. Clearly, the row indices for the first row in each group are 1, 11, and 17. we can get this from the "lengths" component of rle() by: lens <- rle(dat[,1)]$lengths ## Then cumsum(c(1, lens[-length(lens)])) 1] 1 11 17 ## Therefore, the first days are dat[cumsum(c(1, lens[-length(lens)])), 2] [1] 1 5 10 ## So just rep() this with lens to give the FirstDay column: rep(dat[cumsum(c(1, lens[-length(lens)])), 2], lens) [1] 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 10 10 Here's a concise version of the code: lens <- rle(dat$ID)$lengths dat <- within(dat, FirstDay <- Day[cumsum(c(1, lens[-length(lens)]))] |> rep(lens) ) Again, I realize that this sacrifices the clarity of the other solutions that have been given, so I certainly do not claim that it is "better". Nevertheless, I hope it shows another approach that might be interesting and occasionally even useful. Cheers, Bert On Wed, Nov 27, 2024 at 11:38 AM Jeff Newmiller via R-help < r-help@r-project.org> wrote: > Was wondering when this would be suggested. But the question was about > getting the final dataframe... > > > newdta <- olddta > newdta$FirstDay <- ave(newdata$date, newdata$ID, FUN = \(x) x[1L]) > > On November 27, 2024 11:13:49 AM PST, Rui Barradas <ruipbarra...@sapo.pt> > wrote: > >Às 16:30 de 27/11/2024, Sorkin, John escreveu: > >> I am an old, long time SAS programmer. I need to produce R code that > processes a dataframe in a manner that is equivalent to that produced by > using a by statement in SAS and an if first.day statement and a retain > statement: > >> > >> I want to take data (olddata) that looks like this > >> ID Day > >> 1 1 > >> 1 1 > >> 1 2 > >> 1 2 > >> 1 3 > >> 1 3 > >> 1 4 > >> 1 4 > >> 1 5 > >> 1 5 > >> 2 5 > >> 2 5 > >> 2 5 > >> 2 6 > >> 2 6 > >> 2 6 > >> 3 10 > >> 3 10 > >> > >> and make it look like this: > >> (withing each ID I am copying the first value of Day into a new > variable, FirstDay, and propagating the FirstDay value through all rows > that have the same ID: > >> > >> ID Day FirstDay > >> 1 1 1 > >> 1 1 1 > >> 1 2 1 > >> 1 2 1 > >> 1 3 1 > >> 1 3 1 > >> 1 4 1 > >> 1 4 1 > >> 1 5 1 > >> 1 5 1 > >> 2 5 5 > >> 2 5 5 > >> 2 5 5 > >> 2 6 5 > >> 2 6 5 > >> 2 6 5 > >> 3 10 3 > >> 3 10 3 > >> > >> SAS code that can do this is: > >> > >> proc sort data=olddata; > >> by ID Day; > >> run; > >> > >> data newdata; > >> retain FirstDay; > >> set olddata; > >> by ID; > >> if first.ID then FirstDay=Day; > >> run; > >> > >> I have NO idea how to do this is R (so I can't post test-code), but > below I have R code that creates olddata: > >> > >> ID <- c(rep(1,10),rep(2,6),rep(3,2)) > >> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > >> rep(5,3),rep(6,3),rep(10,2)) > >> date > >> olddata <- data.frame(ID=ID,date=date) > >> olddata > >> > >> Any suggestions on how to do this would be appreciated. . . I have > worked on this for more than 12-hours, despite multiple we searches I have > gotten nowhere. . . > >> > >> Thanks > >> John > >> > >> > >> > >> > >> John David Sorkin M.D., Ph.D. > >> Professor of Medicine, University of Maryland School of Medicine; > >> Associate Director for Biostatistics and Informatics, Baltimore VA > Medical Center Geriatrics Research, Education, and Clinical Center; > >> PI Biostatistics and Informatics Core, University of Maryland School of > Medicine Claude D. Pepper Older Americans Independence Center; > >> Senior Statistician University of Maryland Center for Vascular Research; > >> > >> Division of Gerontology and Paliative Care, > >> 10 North Greene Street > >> GRECC (BT/18/GR) > >> Baltimore, MD 21201-1524 > >> Cell phone 443-418-5382 > >> > >> > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > https://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >Hello, > > > >Isn't ?ave the simplest way? > >The first one-liner assumes the dates are sorted in ascending order. > > > > > >ave(olddata$date, olddata$ID, FUN = \(x) x[1L]) > >#> [1] 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 10 10 > > > > > >If the dates are not sorted, > > > > > >ave(olddata$date, olddata$ID, FUN = \(x) min(x)) > > > > > > > >Hope this helps, > > > >Rui Barradas > > > > > > -- > Sent from my phone. Please excuse my brevity. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.