HI Stuart, Just a doubt: When I run Rui's function: I got results as: g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] # ID DATE DG #22 841 20050421 1 #23 841 20050421 2 #38 1019 19870508 2 #39 1019 19870508 1
But, when I look at the id.d 16 814 20020814 2 17 814 20021125 2 18 814 20040429 2 19 814 20040429 2 20 814 20071205 2 21 814 20071205 2 ------------------------- 323 20051111 3 8 323 20060111 2 9 323 20071119 3 10 323 20080107 2 11 323 20080407 1 12 323 20080521 2 13 323 20080521 3 According to the function I wrote earlier, I get res1 flag 58 FALSE 167 TRUE 323 TRUE 547 FALSE 794 FALSE 814 TRUE 841 TRUE 910 FALSE 999 FALSE 1019 TRUE If I understand correctly, you wanted to eliminate 814 and 167 because the DGs are the same. That can be done. I also want to make it clear that why 323 is not included. It is duplicated in the last visit. Now, to eliminate the ones with duplicate DGs: res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1))) res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),] res2[!res2$ID%in% res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,] # ID DATE DG #12 323 20080521 2 #13 323 20080521 3 #22 841 20050421 1 #23 841 20050421 2 #38 1019 19870508 2 #39 1019 19870508 1 A.K. ----- Original Message ----- From: Stuart Leask <stuart.le...@nottingham.ac.uk> To: "arun (smartpink...@yahoo.com)" <smartpink...@yahoo.com>; PIKAL Petr <petr.pi...@precheza.cz>; "Rui Barradas (ruipbarra...@sapo.pt)" <ruipbarra...@sapo.pt> Cc: Sent: Wednesday, October 24, 2012 11:41 AM Subject: RE: [r] How to pick colums from a ragged array? (And, considering the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...) -----Original Message----- From: Leask Stuart Sent: 24 October 2012 16:25 To: arun (smartpink...@yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarra...@sapo.pt) Subject: RE: [r] How to pick colums from a ragged array? Arun, Petr, Rui, many thanks for your help, and the functions you have written. You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last'). Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same! This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves. Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'. You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more Indexing and the 'duplicate' function So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different. Test dataset: ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 ,20091224,20050503,19870508,19870508,19880330) DG<- c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1) id.d<-data.frame(ID,DATE,DG) id.d # Considering Ruis getRepeat function: g.r<-getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) to get the last ones g.rr<-do.call(rbind, g.r) # put the data into a matrix # I can remove the date duplicates with: g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] I'm not sure how to add this to your suggestions, Arun & Petr... Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pi...@precheza.cz] Sent: 23 October 2012 15:24 To: Stuart Leask Subject: RE: [r] How to pick colums from a ragged array? Hi I assumed that id.d is data frame id.d <- data.frame (ID,DATE ) and fff(id.d) works for me Petr > -----Original Message----- > From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] > Sent: Tuesday, October 23, 2012 3:13 PM > To: PIKAL Petr > Subject: RE: [r] How to pick colums from a ragged array? > > Hi Petr. > I see what you mean it should do, but when I run it I get an error > (see below). > Stuart > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > + ,547,794,814,814,814,814,814,814,841,841,841,841,841 > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > + ,1019) > > > > DATE <- > + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > + ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > + ,20091224,20050503,19870508,19870508,19880330) > > > > id.d <- cbind (ID,DATE ) > > fff<-function(data, first=TRUE, remove=FALSE) { > + > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > + x[nrow(x),2]==x[nrow(x)-1,2] > + > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > + data[,1]), testfirst))))) else sel <- > + as.numeric(names(which(unlist(sapply(split(data, data[,1]), > + testlast))))) > + > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] > + } > > > > fff(id.d) > Error in x[1, 2] : incorrect number of dimensions > > > > -----Original Message----- > From: PIKAL Petr [mailto:petr.pi...@precheza.cz] > Sent: 23 October 2012 13:51 > To: Stuart Leask; r-help@r-project.org > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > > -----Original Message----- > > From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] > > Sent: Tuesday, October 23, 2012 2:29 PM > > To: PIKAL Petr; r-help@r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi there. > > > > Not sure I follow what you are doing. > > > > I want a list of all the IDs that have duplicate DATE entries, only > > when the DATE is the earliest (or last) date for that ID. > > And that is what the function (with 3 small modifications) does > > > fff<-function(data, first=TRUE, remove=FALSE) { > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[nrow(x),2]==x[nrow(x)-1,2] > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > data[,1]), testfirst))))) else sel <- > as.numeric(names(which(unlist(sapply(split(data, data[,1]), > testlast))))) > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] } > > See the result of your refined data > > fff(id.d) > ID DATE > 5 167 2004-02-05 > 6 167 2004-02-05 > 22 841 2005-04-21 > 23 841 2005-04-21 > 24 841 2006-04-28 > 25 841 2006-06-02 > 26 841 2006-08-16 > 27 841 2006-10-25 > 28 841 2006-11-29 > 29 841 2007-01-12 > 30 841 2007-05-14 > 38 1019 1987-05-08 > 39 1019 1987-05-08 > 40 1019 1988-03-30 > > fff(id.d, first=F) > ID DATE > 5 167 2004-02-05 > 6 167 2004-02-05 > > fff(id.d, remove=T) > ID DATE > 1 58 2006-08-21 > 2 58 2006-12-07 > 3 58 2008-01-02 > 4 58 2009-09-04 > 7 323 2005-11-11 > 8 323 2006-01-11 > 9 323 2007-11-19 > 10 323 2008-01-07 > 11 323 2008-04-07 > 12 323 2008-05-21 > 13 323 2008-07-11 > 14 547 2004-10-05 > 15 794 2007-09-05 > 16 814 2002-08-14 > 17 814 2002-11-25 > 18 814 2004-04-29 > 19 814 2004-04-29 > 20 814 2007-12-05 > 21 814 2008-02-27 > 31 910 1987-05-08 > 32 910 2004-02-05 > 33 910 2004-02-05 > 34 910 2009-11-20 > 35 910 2009-12-10 > 36 910 2009-12-24 > 37 999 2005-05-03 > > > > You can do surgery on fff function to see what result comes from some > piece of the function e.g. > > sapply(split(id.d, id.d[,1]), testlast) > > Regards > Petr > > > > > I have refined my test dataset, to include some tests (e.g. 910 has > > the same dup as 1019, but for 910 it's not the earliest date): > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > > ,20091224,20050503,19870508,19870508,19880330) > > > > Correct output: > > "167" "841" "1019" > > > > Stuart > > > > -----Original Message----- > > From: PIKAL Petr [mailto:petr.pi...@precheza.cz] > > Sent: 23 October 2012 13:15 > > To: Stuart Leask; r-help@r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi > > > > Rui's answer brought me to more elaborated solution which still > > needs data frame to be ordered by date > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > x[length(x),2]==x[length(x)-1,2] > > > > if(first) sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), > > testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), testlast)))) > > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > > > > > > > fff(id.d) > > ID DATE > > 31 910 20091105 > > 32 910 20091105 > > 33 910 20091117 > > 34 910 20091119 > > 35 910 20091120 > > 36 910 20091210 > > 37 910 20091224 > > 38 910 20091224 > > > > > fff(id.d, remove=T) > > ID DATE > > 1 58 20060821 > > 2 58 20061207 > > 3 58 20080102 > > 4 58 20090904 > > 5 167 20040205 > > 6 167 20040323 > > 7 323 20051111 > > 8 323 20060111 > > 9 323 20071119 > > 10 323 20080107 > > 11 323 20080407 > > 12 323 20080521 > > 13 323 20080711 > > 14 547 20041005 > > 15 794 20070905 > > 16 814 20020814 > > 17 814 20021125 > > 18 814 20040429 > > 19 814 20040429 > > 20 814 20071205 > > 21 814 20080227 > > 22 841 20050421 > > 23 841 20060130 > > 24 841 20060428 > > 25 841 20060602 > > 26 841 20060816 > > 27 841 20061025 > > 28 841 20061129 > > 29 841 20070112 > > 30 841 20070514 > > 39 999 20050503 > > 40 1019 19870508 > > 41 1019 19880223 > > 42 1019 19880330 > > 43 1019 19880330 > > > > > > > Regards > > Petr > > > > > > > -----Original Message----- > > > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > > > project.org] On Behalf Of PIKAL Petr > > > Sent: Tuesday, October 23, 2012 1:49 PM > > > To: Stuart Leask; r-help@r-project.org > > > Subject: Re: [R] [r] How to pick colums from a ragged array? > > > > > > Hi > > > > > > I did not check your code and rather followed your explanation. > BTW, > > > thanks for test data. > > > > > > small change in data frame to make DATE as Date class > > > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > > > data.frame(ID,datum ) > > > > > > ordering by date > > > > > > id.d<-id.d[order(id.d$datum),] > > > > > > > > > two functions to test if first two dates are the same or last two > > > dates are the same > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > x[length(x),2]==x[length(x)-1,2] > > > > > > change one last date in the data frame to be the same as previous > > > > > > id.d[35,2]<-id.d[36,2] > > > > > > and here are results > > > > > > sapply(split(id.d, id.d$ID), testlast) > > > 58 167 323 547 794 814 841 910 999 1019 > > > FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE > > > > > > > sapply(split(id.d, id.d$ID), testfirst) > > > 58 167 323 547 794 814 841 910 999 1019 > > > FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE > > > > > > Now you can select ID which is true and remove it from your data > > > which(sapply(split(id.d, id.d$ID), testlast)) > > > > > > and use it for your data frame to subset/remove id.d$ID == > > > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) > [1] > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE > > > > > > However I am not sure if this is exactly what you want. > > > > > > Regards > > > Petr > > > > > > > -----Original Message----- > > > > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > > > > project.org] On Behalf Of Stuart Leask > > > > Sent: Tuesday, October 23, 2012 11:38 AM > > > > To: r-help@r-project.org > > > > Subject: [R] [r] How to pick colums from a ragged array? > > > > > > > > I have a large dataset (~1 million rows) of three variables: ID > > > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on > > that > > > > date). > > > > Patients may have been assigned more than one diagnosis at any > one > > > > appointment - leading to two rows, same ID and DATE but > > > > different DIAGNOSIS. > > > > The diagnoses may change between appointments. > > > > > > > > I want to subset the data in two ways: > > > > > > > > - define groups of patients by the first diagnosis given > > > > > > > > - define groups of patients by the last diagnosis given. > > > > > > > > The problem: > > > > Unfortunately, a small number of patients have been given more > > > > than one diagnosis at their first (or last) appointment. These > > > > individuals I need to identify and remove, as it's not possible > to > > > > say uniquely what their first (or last) diagnosis was. So I need > > > > to identify and remove these individuals which have pairs of > > > > rows with the same ID > > > and > > > > (lowest or highest) DATE. The size of the dataset precludes the > > > option > > > > of doing this by eye. > > > > > > > > I suspect there is a very elegant way of doing this in R. > > > > > > > > This is what I've come up with: > > > > > > > > > > > > - Sort by DATE then ID > > > > > > > > - Make a ragged array of DATE by ID > > > > > > > > - Remove IDs that only occur once. > > > > > > > > - Subtract the first and second DATEs. Remove IDs for > > which > > > > this = zero, as this will only be true for IDs for which the > > > > appointment is recorded twice (because there were two diagnoses > > > > recorded on this date). > > > > > > > > - (Then do the same to get the 'last appointment' > > > duplicates, > > > > by reversing the initial sort by DATE.) > > > > > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > > > data out of the ragged array by columns (so e.g. I end up with a > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out > > > > by column from the ragged array. > > > > > > > > I hope someone can help. My ugly code is below, with some data > for > > > > testing. > > > > > > > > > > > > Stuart > > > > > > > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. > > UK > > > > Tel. +44 > > > > 115 82 30419 > > > > > stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk > > > > > > > > > Google 'Dr Stuart Leask' > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > ,1019) > > > > > > > > DATE <- > > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > > > ,20091224,20050503,19870508,19880223,19880330) > > > > > > > > id.d <- cbind (ID,DATE ) > > > > rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # > create > > > > ragged array, 1-n DATES for every NAME > > > > > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > > > > > rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up > > the > > > > dates per row > > > > # Since DATE is in 'year mo da', if there's only one date, sum > > > > will > > > be > > > > less than 2100000: > > > > rag.t <- rag.s [ rag.s > 21000000 ] > > > > multi.dates <- rownames ( rag.t ) # all > the > > > IDs > > > > with >1 date > > > > rag.am <- rag.a [ multi.dates ] # > rag.am > > > only > > > > has IDs with > 1 Date > > > > > > > > > > > > # But now I'm stuck. > > > > # Each row of the array is rag.am$ID. > > > > # So I can't pick columns of DATEs from the ragged array. > > > > > > > > This message and any attachment are intended solely for the > > > > addressee and may contain confidential information. If you have > > > > received this message in error, please send it back to me, and > > > > immediately delete > > > it. > > > > Please do not use, copy or disclose the information contained in > > > > this message or in any attachment. Any views or opinions > > > > expressed by the author of this email do not necessarily reflect > > > > the views of the University of Nottingham. > > > > > > > > This message has been checked for viruses but the contents of an > > > > attachment may still contain software viruses which could damage > > > > your computer system: > > > > you are advised to perform your own checks. Email communications > > > > with the University of Nottingham may be monitored as permitted > by > > > > UK legislation. > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help@r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R- > project.org/posting- > > > > guide.html and provide commented, minimal, self-contained, > > > > reproducible code. > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting- > > > guide.html and provide commented, minimal, self-contained, > > > reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.