Hi Rui, I think now our results are matching except in the INCLUDE column
id.d[c(11:13,22:24,38:40),] # ID DATE DG INCLUDE #11 323 20080407 1 TRUE #12 323 20080521 2 FALSE #13 323 20080521 3 TRUE #22 841 20050421 1 TRUE #23 841 20050421 2 FALSE #24 841 20060428 1 TRUE #38 1019 19870508 2 TRUE #39 1019 19870508 1 FALSE #40 1019 19880330 1 TRUE I thought all the rows with the above IDS would be FALSE (from my solution): res4[c(11:13,22:24,38:40),] ID DATE DG INCLUDE #11 323 20080407 1 FALSE #12 323 20080521 2 FALSE #13 323 20080521 3 FALSE #22 841 20050421 1 FALSE #23 841 20050421 2 FALSE #24 841 20060428 1 FALSE #38 1019 19870508 1 FALSE #39 1019 19870508 2 FALSE #40 1019 19880330 1 FALSE A.K. ----- Original Message ----- From: Rui Barradas <ruipbarra...@sapo.pt> To: Stuart Leask <stuart.le...@nottingham.ac.uk> Cc: "arun (smartpink...@yahoo.com)" <smartpink...@yahoo.com>; PIKAL Petr <petr.pi...@precheza.cz>; r-help <r-help@r-project.org> Sent: Wednesday, October 24, 2012 1:41 PM Subject: Re: [r] How to pick colums from a ragged array? Hello, Using one of Arun's ideas, some post ago, this new function returns a logical index into id.d of the rows that should be _removed_, hence rm1 and rm2. I think getRepLogical <- function(x, first = TRUE){ fun <- if(first) head else tail dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2))) len <- tapply(x[,2], x[,1], FUN = length) lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE, if(len[[i]] > 2) len[[i]] - 2 else 0))) lst <- if(first) lst else lapply(lst, rev) i1 <- unlist(lst) dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2))) lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE, if(len[[i]] > 2) len[[i]] - 2 else 0))) lst <- if(first) lst else lapply(lst, rev) i2 <- unlist(lst) i1 & i2 } rm1 <- getRepLogical(id.d) rm2 <- getRepLogical(id.d, first = FALSE) id.d[rm1, ] id.d[rm2, ] id.d$INCLUDE <- !(rm1 | rm2) Hope this helps, Rui Barradas Em 24-10-2012 16:41, Stuart Leask escreveu: > (And, considering the real application, the functions ideally should > probably output a variable INCLUDE, the same length as the original data, > with TRUE and FALSE for whether or not that row should be included...) > > -----Original Message----- > From: Leask Stuart > Sent: 24 October 2012 16:25 > To: arun (smartpink...@yahoo.com); 'PIKAL Petr'; Rui Barradas > (ruipbarra...@sapo.pt) > Subject: RE: [r] How to pick colums from a ragged array? > > Arun, Petr, Rui, many thanks for your help, and the functions you have > written. > > You'll recall I wanted to remove these first (or last) duplicates, because > they represented instances where two different diagnoses (in this case, > variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I > can't say which was 'first' (or 'last'). > > Your functions have revealed something I wasn't expecting: In some cases, the > diagnoses recorded on the duplicated DATEs are the same! > This is a surprise to me, but probably reflects someone going to two > different departments in a clinic, and both departments submit data. I have > to say that crazy things like this are often a feature of real data, which > I'm sure you've come across yourselves. > > Of course, I don't want to remove records in which I can determine an > unambiguous 'first diagnosis'. > > You have all put in so much effort on my behalf, I'm ashamed to ask, but I > wonder if any of the functions you've written could do this with a little more > Indexing and the 'duplicate' function > So the function should only exclude an ID, having identified a first (or > last) DATE duplicate, the DGs for these two dates are different. > > Test dataset: > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > ,1019) > > DATE <- > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 > ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 > ,20091224,20050503,19870508,19870508,19880330) > > DG<- > c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1) > > id.d<-data.frame(ID,DATE,DG) > id.d > > # Considering Ruis getRepeat function: > > g.r<-getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = > FALSE) to get the last ones > g.rr<-do.call(rbind, g.r) # put the data into a matrix > > # I can remove the date duplicates with: > g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] > > I'm not sure how to add this to your suggestions, Arun & Petr... > > > Stuart > > > -----Original Message----- > From: PIKAL Petr [mailto:petr.pi...@precheza.cz] > Sent: 23 October 2012 15:24 > To: Stuart Leask > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > I assumed that id.d is data frame > > id.d <- data.frame (ID,DATE ) > > and > > fff(id.d) > > works for me > > Petr > > >> -----Original Message----- >> From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] >> Sent: Tuesday, October 23, 2012 3:13 PM >> To: PIKAL Petr >> Subject: RE: [r] How to pick colums from a ragged array? >> >> Hi Petr. >> I see what you mean it should do, but when I run it I get an error >> (see below). >> Stuart >> >> >>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 >> + ,547,794,814,814,814,814,814,814,841,841,841,841,841 >> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 >> + ,1019) >>> DATE <- >> + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 >> + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 >> + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 >> + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 >> + ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 >> + ,20091224,20050503,19870508,19870508,19880330) >>> id.d <- cbind (ID,DATE ) >>> fff<-function(data, first=TRUE, remove=FALSE) { >> + >> + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) >> + x[nrow(x),2]==x[nrow(x)-1,2] >> + >> + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, >> + data[,1]), testfirst))))) else sel <- >> + as.numeric(names(which(unlist(sapply(split(data, data[,1]), >> + testlast))))) >> + >> + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] >> + } >>> fff(id.d) >> Error in x[1, 2] : incorrect number of dimensions >> -----Original Message----- >> From: PIKAL Petr [mailto:petr.pi...@precheza.cz] >> Sent: 23 October 2012 13:51 >> To: Stuart Leask; r-help@r-project.org >> Subject: RE: [r] How to pick colums from a ragged array? >> >> Hi >> >>> -----Original Message----- >>> From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] >>> Sent: Tuesday, October 23, 2012 2:29 PM >>> To: PIKAL Petr; r-help@r-project.org >>> Subject: RE: [r] How to pick colums from a ragged array? >>> >>> Hi there. >>> >>> Not sure I follow what you are doing. >>> >>> I want a list of all the IDs that have duplicate DATE entries, only >>> when the DATE is the earliest (or last) date for that ID. >> And that is what the function (with 3 small modifications) does >> >> >> fff<-function(data, first=TRUE, remove=FALSE) { >> >> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) >> x[nrow(x),2]==x[nrow(x)-1,2] >> >> if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, >> data[,1]), testfirst))))) else sel <- >> as.numeric(names(which(unlist(sapply(split(data, data[,1]), >> testlast))))) >> >> if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] } >> >> See the result of your refined data >> >> fff(id.d) >> ID DATE >> 5 167 2004-02-05 >> 6 167 2004-02-05 >> 22 841 2005-04-21 >> 23 841 2005-04-21 >> 24 841 2006-04-28 >> 25 841 2006-06-02 >> 26 841 2006-08-16 >> 27 841 2006-10-25 >> 28 841 2006-11-29 >> 29 841 2007-01-12 >> 30 841 2007-05-14 >> 38 1019 1987-05-08 >> 39 1019 1987-05-08 >> 40 1019 1988-03-30 >>> fff(id.d, first=F) >> ID DATE >> 5 167 2004-02-05 >> 6 167 2004-02-05 >>> fff(id.d, remove=T) >> ID DATE >> 1 58 2006-08-21 >> 2 58 2006-12-07 >> 3 58 2008-01-02 >> 4 58 2009-09-04 >> 7 323 2005-11-11 >> 8 323 2006-01-11 >> 9 323 2007-11-19 >> 10 323 2008-01-07 >> 11 323 2008-04-07 >> 12 323 2008-05-21 >> 13 323 2008-07-11 >> 14 547 2004-10-05 >> 15 794 2007-09-05 >> 16 814 2002-08-14 >> 17 814 2002-11-25 >> 18 814 2004-04-29 >> 19 814 2004-04-29 >> 20 814 2007-12-05 >> 21 814 2008-02-27 >> 31 910 1987-05-08 >> 32 910 2004-02-05 >> 33 910 2004-02-05 >> 34 910 2009-11-20 >> 35 910 2009-12-10 >> 36 910 2009-12-24 >> 37 999 2005-05-03 >> You can do surgery on fff function to see what result comes from some >> piece of the function e.g. >> >> sapply(split(id.d, id.d[,1]), testlast) >> >> Regards >> Petr >> >>> I have refined my test dataset, to include some tests (e.g. 910 has >>> the same dup as 1019, but for 910 it's not the earliest date): >>> >>> >>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 >>> ,547,794,814,814,814,814,814,814,841,841,841,841,841 >>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 >>> ,1019) >>> >>> DATE <- >>> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 >>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 >>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 >>> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 >>> ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 >>> ,20091224,20050503,19870508,19870508,19880330) >>> >>> Correct output: >>> "167" "841" "1019" >>> >>> Stuart >>> >>> -----Original Message----- >>> From: PIKAL Petr [mailto:petr.pi...@precheza.cz] >>> Sent: 23 October 2012 13:15 >>> To: Stuart Leask; r-help@r-project.org >>> Subject: RE: [r] How to pick colums from a ragged array? >>> >>> Hi >>> >>> Rui's answer brought me to more elaborated solution which still >>> needs data frame to be ordered by date >>> >>> fff<-function(data, first=TRUE, remove=FALSE) { >>> >>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) >>> x[length(x),2]==x[length(x)-1,2] >>> >>> if(first) sel <- as.numeric(names(which(sapply(split(data, >>> data[,1]), >>> testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, >>> data[,1]), testlast)))) >>> >>> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } >>> >>> >>>> fff(id.d) >>> ID DATE >>> 31 910 20091105 >>> 32 910 20091105 >>> 33 910 20091117 >>> 34 910 20091119 >>> 35 910 20091120 >>> 36 910 20091210 >>> 37 910 20091224 >>> 38 910 20091224 >>> >>>> fff(id.d, remove=T) >>> ID DATE >>> 1 58 20060821 >>> 2 58 20061207 >>> 3 58 20080102 >>> 4 58 20090904 >>> 5 167 20040205 >>> 6 167 20040323 >>> 7 323 20051111 >>> 8 323 20060111 >>> 9 323 20071119 >>> 10 323 20080107 >>> 11 323 20080407 >>> 12 323 20080521 >>> 13 323 20080711 >>> 14 547 20041005 >>> 15 794 20070905 >>> 16 814 20020814 >>> 17 814 20021125 >>> 18 814 20040429 >>> 19 814 20040429 >>> 20 814 20071205 >>> 21 814 20080227 >>> 22 841 20050421 >>> 23 841 20060130 >>> 24 841 20060428 >>> 25 841 20060602 >>> 26 841 20060816 >>> 27 841 20061025 >>> 28 841 20061129 >>> 29 841 20070112 >>> 30 841 20070514 >>> 39 999 20050503 >>> 40 1019 19870508 >>> 41 1019 19880223 >>> 42 1019 19880330 >>> 43 1019 19880330 >>> Regards >>> Petr >>> >>> >>>> -----Original Message----- >>>> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- >>>> project.org] On Behalf Of PIKAL Petr >>>> Sent: Tuesday, October 23, 2012 1:49 PM >>>> To: Stuart Leask; r-help@r-project.org >>>> Subject: Re: [R] [r] How to pick colums from a ragged array? >>>> >>>> Hi >>>> >>>> I did not check your code and rather followed your explanation. >> BTW, >>>> thanks for test data. >>>> >>>> small change in data frame to make DATE as Date class >>>> >>>> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- >>>> data.frame(ID,datum ) >>>> >>>> ordering by date >>>> >>>> id.d<-id.d[order(id.d$datum),] >>>> >>>> >>>> two functions to test if first two dates are the same or last two >>>> dates are the same >>>> >>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) >>>> x[length(x),2]==x[length(x)-1,2] >>>> >>>> change one last date in the data frame to be the same as previous >>>> >>>> id.d[35,2]<-id.d[36,2] >>>> >>>> and here are results >>>> >>>> sapply(split(id.d, id.d$ID), testlast) >>>> 58 167 323 547 794 814 841 910 999 1019 >>>> FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE >>>> >>>>> sapply(split(id.d, id.d$ID), testfirst) >>>> 58 167 323 547 794 814 841 910 999 1019 >>>> FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE >>>> >>>> Now you can select ID which is true and remove it from your data >>>> which(sapply(split(id.d, id.d$ID), testlast)) >>>> >>>> and use it for your data frame to subset/remove id.d$ID == >>>> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) >> [1] >>>> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>>> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> FALSE >>>> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> FALSE >>>> FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE >>>> >>>> However I am not sure if this is exactly what you want. >>>> >>>> Regards >>>> Petr >>>> >>>>> -----Original Message----- >>>>> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- >>>>> project.org] On Behalf Of Stuart Leask >>>>> Sent: Tuesday, October 23, 2012 11:38 AM >>>>> To: r-help@r-project.org >>>>> Subject: [R] [r] How to pick colums from a ragged array? >>>>> >>>>> I have a large dataset (~1 million rows) of three variables: ID >>>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on >>> that >>>>> date). >>>>> Patients may have been assigned more than one diagnosis at any >> one >>>>> appointment - leading to two rows, same ID and DATE but >>>>> different DIAGNOSIS. >>>>> The diagnoses may change between appointments. >>>>> >>>>> I want to subset the data in two ways: >>>>> >>>>> - define groups of patients by the first diagnosis given >>>>> >>>>> - define groups of patients by the last diagnosis given. >>>>> >>>>> The problem: >>>>> Unfortunately, a small number of patients have been given more >>>>> than one diagnosis at their first (or last) appointment. These >>>>> individuals I need to identify and remove, as it's not possible >> to >>>>> say uniquely what their first (or last) diagnosis was. So I need >>>>> to identify and remove these individuals which have pairs of >>>>> rows with the same ID >>>> and >>>>> (lowest or highest) DATE. The size of the dataset precludes the >>>> option >>>>> of doing this by eye. >>>>> >>>>> I suspect there is a very elegant way of doing this in R. >>>>> >>>>> This is what I've come up with: >>>>> >>>>> >>>>> - Sort by DATE then ID >>>>> >>>>> - Make a ragged array of DATE by ID >>>>> >>>>> - Remove IDs that only occur once. >>>>> >>>>> - Subtract the first and second DATEs. Remove IDs for >>> which >>>>> this = zero, as this will only be true for IDs for which the >>>>> appointment is recorded twice (because there were two diagnoses >>>>> recorded on this date). >>>>> >>>>> - (Then do the same to get the 'last appointment' >>>> duplicates, >>>>> by reversing the initial sort by DATE.) >>>>> >>>>> I am stuck at the 'Subtract dates' step: I would like to get the >>>>> data out of the ragged array by columns (so e.g. I end up with a >>>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out >>>>> by column from the ragged array. >>>>> >>>>> I hope someone can help. My ugly code is below, with some data >> for >>>>> testing. >>>>> >>>>> >>>>> Stuart >>>>> >>>>> >>>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior >>>>> Lecturer and Honorary Consultant Pychiatrist Institute of Mental >>>>> Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. >>> UK >>>>> Tel. +44 >>>>> 115 82 30419 >>>>> >> stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk >>>>> Google 'Dr Stuart Leask' >>>>> >>>>> >>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 >>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841 >>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 >>>>> ,1019) >>>>> >>>>> DATE <- >>>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 >>>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 >>>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 >>>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 >>>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 >>>>> ,20091224,20050503,19870508,19880223,19880330) >>>>> >>>>> id.d <- cbind (ID,DATE ) >>>>> rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # >> create >>>>> ragged array, 1-n DATES for every NAME >>>>> >>>>> # Inelegant attempt to remove IDs that only have one entry: >>>>> >>>>> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up >>> the >>>>> dates per row >>>>> # Since DATE is in 'year mo da', if there's only one date, sum >>>>> will >>>> be >>>>> less than 2100000: >>>>> rag.t <- rag.s [ rag.s > 21000000 ] >>>>> multi.dates <- rownames ( rag.t ) # all >> the >>>> IDs >>>>> with >1 date >>>>> rag.am <- rag.a [ multi.dates ] # >> rag.am >>>> only >>>>> has IDs with > 1 Date >>>>> >>>>> >>>>> # But now I'm stuck. >>>>> # Each row of the array is rag.am$ID. >>>>> # So I can't pick columns of DATEs from the ragged array. >>>>> >>>>> This message and any attachment are intended solely for the >>>>> addressee and may contain confidential information. If you have >>>>> received this message in error, please send it back to me, and >>>>> immediately delete >>>> it. >>>>> Please do not use, copy or disclose the information contained in >>>>> this message or in any attachment. Any views or opinions >>>>> expressed by the author of this email do not necessarily reflect >>>>> the views of the University of Nottingham. >>>>> >>>>> This message has been checked for viruses but the contents of an >>>>> attachment may still contain software viruses which could damage >>>>> your computer system: >>>>> you are advised to perform your own checks. Email communications >>>>> with the University of Nottingham may be monitored as permitted >> by >>>>> UK legislation. >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R- >> project.org/posting- >>>>> guide.html and provide commented, minimal, self-contained, >>>>> reproducible code. >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posting- >>>> guide.html and provide commented, minimal, self-contained, >>>> reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.