Hi, Also one more thing: This should get the dates which are duplicated. In my first reply, I was looking for the duplicated rows. Sorry for that!
id.d<-data.frame(ID,DATE) new1<-id.d[duplicated(id.d$DATE)|duplicated(id.d$DATE,fromLast=TRUE),] new2<-new1[order(new1$ID,new1$DATE),] tapply(new2$ID,new2$DATE,head,1) #19870508 20040205 20040429 20050421 # 910 167 814 841 But, still the result is not that you wanted, because 910's date is the earliest date when compared to 1019. new1[order(new1$ID,new1$DATE),] # ID DATE #5 167 20040205 #6 167 20040205 #18 814 20040429 #19 814 20040429 #22 841 20050421 #23 841 20050421 #31 910 19870508 #32 910 20040205 #33 910 20040205 #38 1019 19870508 #39 1019 19870508 A.K. ----- Original Message ----- From: Stuart Leask <stuart.le...@nottingham.ac.uk> To: arun <smartpink...@yahoo.com> Cc: Petr PIKAL <petr.pi...@precheza.cz> Sent: Tuesday, October 23, 2012 9:15 AM Subject: RE: [R] [r] How to pick colums from a ragged array? Sorry Arun, but when I run it I get an error: > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 + ,547,794,814,814,814,814,814,814,841,841,841,841,841 + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 + ,1019) > > DATE <- + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 + ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 + ,20091224,20050503,19870508,19870508,19880330) > > id.d <- cbind (ID,DATE ) > new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),] > > > tapply(new1$ID,new1$DATE,head,1) Error in new1$DATE : $ operator is invalid for atomic vectors -----Original Message----- From: arun [mailto:smartpink...@yahoo.com] Sent: 23 October 2012 14:05 To: Stuart Leask Cc: R help; Petr PIKAL Subject: Re: [R] [r] How to pick colums from a ragged array? HI, I was not following the thread. May be this is what you are looking for: new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),] tapply(new1$ID,new1$DATE,head,1) #19870508 20040205 20040429 20050421 # 1019 167 814 841 A.K. ----- Original Message ----- From: Stuart Leask <stuart.le...@nottingham.ac.uk> To: PIKAL Petr <petr.pi...@precheza.cz>; "r-help@r-project.org" <r-help@r-project.org> Cc: Sent: Tuesday, October 23, 2012 8:28 AM Subject: Re: [R] [r] How to pick colums from a ragged array? Hi there. Not sure I follow what you are doing. I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID. I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date): ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 ,20091224,20050503,19870508,19870508,19880330) Correct output: "167" "841" "1019" Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pi...@precheza.cz] Sent: 23 October 2012 13:15 To: Stuart Leask; r-help@r-project.org Subject: RE: [r] How to pick colums from a ragged array? Hi Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date fff<-function(data, first=TRUE, remove=FALSE) { testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast)))) if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > fff(id.d) ID DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224 > fff(id.d, remove=T) ID DATE 1 58 20060821 2 58 20061207 3 58 20080102 4 58 20090904 5 167 20040205 6 167 20040323 7 323 20051111 8 323 20060111 9 323 20071119 10 323 20080107 11 323 20080407 12 323 20080521 13 323 20080711 14 547 20041005 15 794 20070905 16 814 20020814 17 814 20021125 18 814 20040429 19 814 20040429 20 814 20071205 21 814 20080227 22 841 20050421 23 841 20060130 24 841 20060428 25 841 20060602 26 841 20060816 27 841 20061025 28 841 20061129 29 841 20070112 30 841 20070514 39 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330 > Regards Petr > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > project.org] On Behalf Of PIKAL Petr > Sent: Tuesday, October 23, 2012 1:49 PM > To: Stuart Leask; r-help@r-project.org > Subject: Re: [R] [r] How to pick colums from a ragged array? > > Hi > > I did not check your code and rather followed your explanation. BTW, > thanks for test data. > > small change in data frame to make DATE as Date class > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > data.frame(ID,datum ) > > ordering by date > > id.d<-id.d[order(id.d$datum),] > > > two functions to test if first two dates are the same or last two > dates are the same > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[length(x),2]==x[length(x)-1,2] > > change one last date in the data frame to be the same as previous > > id.d[35,2]<-id.d[36,2] > > and here are results > > sapply(split(id.d, id.d$ID), testlast) > 58 167 323 547 794 814 841 910 999 1019 FALSE >FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE > > > sapply(split(id.d, id.d$ID), testfirst) > 58 167 323 547 794 814 841 910 999 1019 FALSE >FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE > > Now you can select ID which is true and remove it from your data > which(sapply(split(id.d, id.d$ID), testlast)) > > and use it for your data frame to subset/remove id.d$ID == > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) [1] > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE > > However I am not sure if this is exactly what you want. > > Regards > Petr > > > -----Original Message----- > > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > > project.org] On Behalf Of Stuart Leask > > Sent: Tuesday, October 23, 2012 11:38 AM > > To: r-help@r-project.org > > Subject: [R] [r] How to pick colums from a ragged array? > > > > I have a large dataset (~1 million rows) of three variables: ID > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that > > date). > > Patients may have been assigned more than one diagnosis at any one > > appointment - leading to two rows, same ID and DATE but different > > DIAGNOSIS. > > The diagnoses may change between appointments. > > > > I want to subset the data in two ways: > > > > - define groups of patients by the first diagnosis given > > > > - define groups of patients by the last diagnosis given. > > > > The problem: > > Unfortunately, a small number of patients have been given more than > > one diagnosis at their first (or last) appointment. These > > individuals I need to identify and remove, as it's not possible to > > say uniquely what their first (or last) diagnosis was. So I need to > > identify and remove these individuals which have pairs of rows with > > the same ID > and > > (lowest or highest) DATE. The size of the dataset precludes the > option > > of doing this by eye. > > > > I suspect there is a very elegant way of doing this in R. > > > > This is what I've come up with: > > > > > > - Sort by DATE then ID > > > > - Make a ragged array of DATE by ID > > > > - Remove IDs that only occur once. > > > > - Subtract the first and second DATEs. Remove IDs for which > > this = zero, as this will only be true for IDs for which the > > appointment is recorded twice (because there were two diagnoses > > recorded on this date). > > > > - (Then do the same to get the 'last appointment' > duplicates, > > by reversing the initial sort by DATE.) > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > data out of the ragged array by columns (so e.g. I end up with a > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by > > column from the ragged array. > > > > I hope someone can help. My ugly code is below, with some data for > > testing. > > > > > > Stuart > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK > > Tel. +44 > > 115 82 30419 > > stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk> > > Google 'Dr Stuart Leask' > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > ,20091224,20050503,19870508,19880223,19880330) > > > > id.d <- cbind (ID,DATE ) > > rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create > > ragged array, 1-n DATES for every NAME > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up > > the dates per row # Since DATE is in 'year mo da', if there's only > > one date, sum will > be > > less than 2100000: > > rag.t <- rag.s [ rag.s > 21000000 ] > > multi.dates <- rownames ( rag.t ) # all the > IDs > > with >1 date > > rag.am <- rag.a [ multi.dates ] # rag.am > only > > has IDs with > 1 Date > > > > > > # But now I'm stuck. > > # Each row of the array is rag.am$ID. > > # So I can't pick columns of DATEs from the ragged array. > > > > This message and any attachment are intended solely for the > > addressee and may contain confidential information. If you have > > received this message in error, please send it back to me, and > > immediately delete > it. > > Please do not use, copy or disclose the information contained in > > this message or in any attachment. Any views or opinions expressed > > by the author of this email do not necessarily reflect the views of > > the University of Nottingham. > > > > This message has been checked for viruses but the contents of an > > attachment may still contain software viruses which could damage > > your computer system: > > you are advised to perform your own checks. Email communications > > with the University of Nottingham may be monitored as permitted by > > UK legislation. > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html and provide commented, minimal, self-contained, > > reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html and provide commented, minimal, self-contained, > reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.