Ah, no, my method does fail. Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate. Test data is all - see below it failing.
So, I remain very grateful for your function! Stuart ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 ,20091224,20050503,19870508,19870508,19880330) id.d <- cbind (ID,DATE ) # rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: # rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: # rag.t <- rag.s [ rag.s > 21000000 ] # multi.dates <- rownames ( rag.t ) # all the IDs with >1 date # rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) nd.b<- id.d[how.many > 1, ] #ni<-dim(nd.b)[1] #nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups #nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates earliest<-tapply(DATE,ID,min) # table of mins rownames(earliest[earliest%in%nd.b]) # IDs of dups with min # This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate. -----Original Message----- From: Leask Stuart Sent: 23 October 2012 12:38 To: 'Rui Barradas' Cc: r-help@r-project.org Subject: RE: [R] [r] How to pick colums from a ragged array? Thanks Rui - your initial, very elegant suggestion, has spurred me on! 1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID. 2. Your suggestion gave me all the duplicates: how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length) nd.b<- id.d [ how.many > 1, ] 3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared: earliest <- tapply ( DATE, ID, min) rownames(earliest[earliest%in%nd.b]) This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID. I'm trying to work out why! Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!). Stuart ************************************************ ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210 ,20091224,20050503,19870508,19870508,19880330) id.d <- cbind (ID,DATE ) how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) nd.b<- id.d[how.many > 1, ] earliest <- tapply ( DATE, ID, min) # table of earliest DATEs rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think... ****************************************************************** -----Original Message----- From: Rui Barradas [mailto:ruipbarra...@sapo.pt] Sent: 23 October 2012 12:21 To: Stuart Leask Cc: r-help@r-project.org Subject: Re: [R] [r] How to pick colums from a ragged array? Hello, Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset. ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019 ,1019,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210 ,20091224,20091224,20050503,19870508,19880223,19880330,19880330) id.d <- cbind(ID, DATE) getRepeat <- function(x, first = TRUE){ fun <- if(first) head else tail sp <- split(data.frame(x), x[,1]) first.date <- tapply(x[,2], x[,1], FUN = fun, 1) lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j]) n <- unlist(lapply(lst, sum)) sp1 <- sp[n > 1] i1 <- lst[n > 1] lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) } getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones Hope this helps, Rui Barradas Em 23-10-2012 10:59, Rui Barradas escreveu: > Hello, > > I'm not sure I understand it well, in the solution below the only > returned value is ID == 814 but it's not the first nor the last DATE. > > how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) > id.d[how.many > 1, ] > > See the help page for ?ave if the repetition of id.d[,1] is confusing. > The first is the vector to average (to apply FUN to) and the second is > one of thw two vectors defining the groups. > > Hope this helps, > > Rui Barradas > Em 23-10-2012 10:37, Stuart Leask escreveu: >> I have a large dataset (~1 million rows) of three variables: ID >> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that >> date). >> Patients may have been assigned more than one diagnosis at any one >> appointment - leading to two rows, same ID and DATE but different >> DIAGNOSIS. >> The diagnoses may change between appointments. >> >> I want to subset the data in two ways: >> >> - define groups of patients by the first diagnosis given >> >> - define groups of patients by the last diagnosis given. >> >> The problem: >> Unfortunately, a small number of patients have been given more than >> one diagnosis at their first (or last) appointment. These individuals >> I need to identify and remove, as it's not possible to say uniquely >> what their first (or last) diagnosis was. So I need to identify and >> remove these individuals which have pairs of rows with the same ID >> and (lowest or highest) DATE. The size of the dataset precludes the >> option of doing this by eye. >> >> I suspect there is a very elegant way of doing this in R. >> >> This is what I've come up with: >> >> >> - Sort by DATE then ID >> >> - Make a ragged array of DATE by ID >> >> - Remove IDs that only occur once. >> >> - Subtract the first and second DATEs. Remove IDs for which >> this = zero, as this will only be true for IDs for which the >> appointment is recorded twice (because there were two diagnoses >> recorded on this date). >> >> - (Then do the same to get the 'last appointment' >> duplicates, by reversing the initial sort by DATE.) >> >> I am stuck at the 'Subtract dates' step: I would like to get the data >> out of the ragged array by columns (so e.g. I end up with a matrix of >> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from >> the ragged array. >> >> I hope someone can help. My ugly code is below, with some data for >> testing. >> >> >> Stuart >> >> >> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer >> and Honorary Consultant Pychiatrist Institute of Mental Health, >> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 >> 115 82 30419 >> stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk> >> Google 'Dr Stuart Leask' >> >> >> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 >> ,547,794,814,814,814,814,814,814,841,841,841,841,841 >> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 >> ,1019) >> >> DATE <- >> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 >> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 >> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 >> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 >> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 >> ,20091224,20050503,19870508,19880223,19880330) >> >> id.d <- cbind (ID,DATE ) >> rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create >> ragged array, 1-n DATES for every NAME >> >> # Inelegant attempt to remove IDs that only have one entry: >> >> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per >> row # Since DATE is in 'year mo da', if there's only one date, sum >> will be less than 2100000: >> rag.t <- rag.s [ rag.s > 21000000 ] >> multi.dates <- rownames ( rag.t ) # all the >> IDs with >1 date >> rag.am <- rag.a [ multi.dates ] # rag.am >> only has IDs with > 1 Date >> >> >> # But now I'm stuck. >> # Each row of the array is rag.am$ID. >> # So I can't pick columns of DATEs from the ragged array. >> >> This message and any attachment are intended solely for the addressee >> and may contain confidential information. If you have received this >> message in error, please send it back to me, and immediately delete >> it. Please do not use, copy or disclose the information contained >> in this message or in any attachment. Any views or opinions >> expressed by the author of this email do not necessarily reflect the >> views of the University of Nottingham. >> >> This message has been checked for viruses but the contents of an >> attachment may still contain software viruses which could damage your >> computer >> system: >> you are advised to perform your own checks. Email communications with >> the University of Nottingham may be monitored as permitted by UK >> legislation. >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.