So I get my list of IDs to exclude from: g.rr<-do.call(rbind, g.r)[1] dim(g.rr) g.rr[1:(dim(g.rr)[1]/2)]
Many thanks. Stuart -----Original Message----- From: Rui Barradas [mailto:ruipbarra...@sapo.pt] Sent: 23 October 2012 13:42 To: Stuart Leask Cc: r-help@r-project.org Subject: Re: FW: [R] [r] How to pick colums from a ragged array? Hello, You're right, getRepeat returns a list of data.frames, one per each ID. To put them all in the same df use do.call(rbind, g.r) Rui Barradas Em 23-10-2012 13:36, Stuart Leask escreveu: > Sorry, I must be a bit thick.! > getRepeat gives me the data with duplicates - but I don't seem to be able to > manipulate the result. It looks like a list of dataframes: > >> g.r<-getRepeat(id.d) >> dim(g.r) > NULL >> summary(g.r) > Length Class Mode > [1,] 2 data.frame list > [2,] 2 data.frame list > [3,] 2 data.frame list > > This leaves me with the same problem I had with my ragged array i.e. how do I > put all the second elements from this long list of data frames, into a single > list? > > I need to end up with a list of all the IDs that have duplicate first (or > last) DATES. > > Stuart > > -----Original Message----- > From: Leask Stuart > Sent: 23 October 2012 13:17 > To: 'Rui Barradas' > Cc: 'r-help@r-project.org' > Subject: RE: [R] [r] How to pick colums from a ragged array? > > Ah, no, my method does fail. > Consider an ID that has a duplicate DATE that isn't the first date, but it's > first date is the same as another ID's first date that IS a duplicate. > Test data is all - see below it failing. > > So, I remain very grateful for your function! > > Stuart > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > ,1019) > > DATE <- > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > ,20091224,20050503,19870508,19870508,19880330) > > id.d <- cbind (ID,DATE ) > # rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged > array, 1-n DATES for every NAME > # Inelegant attempt to remove IDs that only have one entry: > # rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates > per row > # Since DATE is in 'year mo da', if there's only one date, sum will be less > than 2100000: > # rag.t <- rag.s [ rag.s > 21000000 ] > # multi.dates <- rownames ( rag.t ) # all the IDs > with >1 date > # rag.am <- rag.a [ multi.dates ] # rag.am only has > IDs with > 1 Date > > how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) > nd.b<- id.d[how.many > 1, ] > > #ni<-dim(nd.b)[1] > #nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups > #nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates > > earliest<-tapply(DATE,ID,min) # table of mins > rownames(earliest[earliest%in%nd.b]) # IDs of dups with min > # This suggests ID 910 has a duplicate earliest, and it doesn't - it has a > non-earliest duplicate, # and an earliest date that is the same as another > ID's earliest+duplicate. > > > -----Original Message----- > From: Leask Stuart > Sent: 23 October 2012 12:38 > To: 'Rui Barradas' > Cc: r-help@r-project.org > Subject: RE: [R] [r] How to pick colums from a ragged array? > > Thanks Rui - your initial, very elegant suggestion, has spurred me on! > > 1. As you noticed, my example data had no examples of duplicate first dates > (DOH!) I have corrected this, and added a test - an ID that has a duplicate > which is not the earliest DATE, but is the same DATE an earliest/duplicate > for another ID. > > 2. Your suggestion gave me all the duplicates: > > how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length) > nd.b<- id.d [ how.many > 1, ] > > 3. I can then simply make a table of earliest DATEs by ID, and then see which > DATEs in this table are shared: > > earliest <- tapply ( DATE, ID, min) > rownames(earliest[earliest%in%nd.b]) > > This seems to work - and it does seem exclude IDs which have a duplicate date > which is the same as a minimum date for another ID. > I'm trying to work out why! > > > Many, many thanks for the gift of that function. I will compare the two > approaches (and assume that mine is flawed!). > > > Stuart > > > ************************************************ > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > ,1019) > > DATE <- > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210 > ,20091224,20050503,19870508,19870508,19880330) > > id.d <- cbind (ID,DATE ) > > how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) > nd.b<- id.d[how.many > 1, ] > > earliest <- tapply ( DATE, ID, min) # table of earliest > DATEs > rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the > earliest date for that individual. I think... > > > > > ****************************************************************** > > > > -----Original Message----- > From: Rui Barradas [mailto:ruipbarra...@sapo.pt] > Sent: 23 October 2012 12:21 > To: Stuart Leask > Cc: r-help@r-project.org > Subject: Re: [R] [r] How to pick colums from a ragged array? > > Hello, > > Thinking again, if you just want the first/last in each ID that repeats the > DATE, the following function does the job. Since there were no such cases in > your data example, I've added 3 rows to the dataset. > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > ,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019 > ,1019,1019) > > DATE <- > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > ,20070112,20070514,20091105,20091105,20091117,20091119,20091120,200912 > 10 > ,20091224,20091224,20050503,19870508,19880223,19880330,19880330) > > id.d <- cbind(ID, DATE) > > > getRepeat <- function(x, first = TRUE){ > fun <- if(first) head else tail > sp <- split(data.frame(x), x[,1]) > first.date <- tapply(x[,2], x[,1], FUN = fun, 1) > lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j]) > n <- unlist(lapply(lst, sum)) > sp1 <- sp[n > 1] > i1 <- lst[n > 1] > lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) } > > getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = > FALSE) # to get the last ones > > > Hope this helps, > > Rui Barradas > > > Em 23-10-2012 10:59, Rui Barradas escreveu: >> Hello, >> >> I'm not sure I understand it well, in the solution below the only >> returned value is ID == 814 but it's not the first nor the last DATE. >> >> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) >> id.d[how.many > 1, ] >> >> See the help page for ?ave if the repetition of id.d[,1] is confusing. >> The first is the vector to average (to apply FUN to) and the second >> is one of thw two vectors defining the groups. >> >> Hope this helps, >> >> Rui Barradas >> Em 23-10-2012 10:37, Stuart Leask escreveu: >>> I have a large dataset (~1 million rows) of three variables: ID >>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that >>> date). >>> Patients may have been assigned more than one diagnosis at any one >>> appointment - leading to two rows, same ID and DATE but different >>> DIAGNOSIS. >>> The diagnoses may change between appointments. >>> >>> I want to subset the data in two ways: >>> >>> - define groups of patients by the first diagnosis given >>> >>> - define groups of patients by the last diagnosis given. >>> >>> The problem: >>> Unfortunately, a small number of patients have been given more than >>> one diagnosis at their first (or last) appointment. These >>> individuals I need to identify and remove, as it's not possible to >>> say uniquely what their first (or last) diagnosis was. So I need to >>> identify and remove these individuals which have pairs of rows with >>> the same ID and (lowest or highest) DATE. The size of the dataset >>> precludes the option of doing this by eye. >>> >>> I suspect there is a very elegant way of doing this in R. >>> >>> This is what I've come up with: >>> >>> >>> - Sort by DATE then ID >>> >>> - Make a ragged array of DATE by ID >>> >>> - Remove IDs that only occur once. >>> >>> - Subtract the first and second DATEs. Remove IDs for which >>> this = zero, as this will only be true for IDs for which the >>> appointment is recorded twice (because there were two diagnoses >>> recorded on this date). >>> >>> - (Then do the same to get the 'last appointment' >>> duplicates, by reversing the initial sort by DATE.) >>> >>> I am stuck at the 'Subtract dates' step: I would like to get the >>> data out of the ragged array by columns (so e.g. I end up with a >>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by >>> column from the ragged array. >>> >>> I hope someone can help. My ugly code is below, with some data for >>> testing. >>> >>> >>> Stuart >>> >>> >>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior >>> Lecturer and Honorary Consultant Pychiatrist Institute of Mental >>> Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK >>> Tel. +44 >>> 115 82 30419 >>> stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk> >>> Google 'Dr Stuart Leask' >>> >>> >>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 >>> ,547,794,814,814,814,814,814,814,841,841,841,841,841 >>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 >>> ,1019) >>> >>> DATE <- >>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 >>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 >>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 >>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 >>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 >>> ,20091224,20050503,19870508,19880223,19880330) >>> >>> id.d <- cbind (ID,DATE ) >>> rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create >>> ragged array, 1-n DATES for every NAME >>> >>> # Inelegant attempt to remove IDs that only have one entry: >>> >>> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per >>> row # Since DATE is in 'year mo da', if there's only one date, sum >>> will be less than 2100000: >>> rag.t <- rag.s [ rag.s > 21000000 ] >>> multi.dates <- rownames ( rag.t ) # all the >>> IDs with >1 date >>> rag.am <- rag.a [ multi.dates ] # rag.am >>> only has IDs with > 1 Date >>> >>> >>> # But now I'm stuck. >>> # Each row of the array is rag.am$ID. >>> # So I can't pick columns of DATEs from the ragged array. >>> >>> This message and any attachment are intended solely for the >>> addressee and may contain confidential information. If you have >>> received this message in error, please send it back to me, and immediately >>> delete >>> it. Please do not use, copy or disclose the information contained >>> in this message or in any attachment. Any views or opinions >>> expressed by the author of this email do not necessarily reflect the >>> views of the University of Nottingham. >>> >>> This message has been checked for viruses but the contents of an >>> attachment may still contain software viruses which could damage >>> your computer >>> system: >>> you are advised to perform your own checks. Email communications >>> with the University of Nottingham may be monitored as permitted by >>> UK legislation. >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.