Re: [R] [r] How to pick colums from a ragged array?

Rui Barradas Wed, 24 Oct 2012 10:50:39 -0700

Hello,

Using one of Arun's ideas, some post ago, this new function returns alogical index into id.d of the rows that should be _removed_, hence rm1and rm2. I think




getRepLogical <- function(x, first = TRUE){
    fun <- if(first) head else tail
    dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2)))
    len <- tapply(x[,2], x[,1], FUN = length)

lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE,if(len[[i]] > 2) len[[i]] - 2 else 0)))

    lst <- if(first) lst else lapply(lst, rev)
    i1 <- unlist(lst)
    dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2)))

lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE,if(len[[i]] > 2) len[[i]] - 2 else 0)))

    lst <- if(first) lst else lapply(lst, rev)
    i2 <- unlist(lst)
    i1 & i2
}

rm1 <- getRepLogical(id.d)
rm2 <- getRepLogical(id.d, first = FALSE)

id.d[rm1, ]
id.d[rm2, ]

id.d$INCLUDE <- !(rm1 | rm2)


Hope this helps,

Rui Barradas
Em 24-10-2012 16:41, Stuart Leask escreveu:

(And, considering  the real application, the functions ideally should probably 
output a variable INCLUDE, the same length as the original data, with TRUE and 
FALSE for whether or not that row should be included...)

-----Original Message-----
From: Leask Stuart
Sent: 24 October 2012 16:25
To: arun (smartpink...@yahoo.com); 'PIKAL Petr'; Rui Barradas 
(ruipbarra...@sapo.pt)
Subject: RE: [r] How to pick colums from a ragged array?

Arun, Petr, Rui, many thanks for your help, and the functions you have written.

You'll recall I wanted to remove these first (or last) duplicates, because they 
represented instances where two different diagnoses (in this case, variable DG, 
value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which 
was 'first' (or 'last').

Your functions have revealed something I wasn't expecting: In some cases, the 
diagnoses recorded on the duplicated DATEs are the same!
This is a surprise to me, but probably reflects someone going to two different 
departments in a clinic, and both departments submit data. I have to say that 
crazy things like this are often a feature of real data, which I'm sure you've 
come across yourselves.

Of course, I don't want to remove records in which I can determine an 
unambiguous 'first diagnosis'.

You have all put in so much effort on my behalf, I'm ashamed to ask, but I 
wonder if any of the functions you've written could do this with a little more
Indexing and the 'duplicate' function
So the function should only exclude an ID, having identified a first (or last) 
DATE duplicate, the DGs for these two dates are different.

Test dataset:

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
  ,20060111,20071119,20080107,20080407,20080521,20080521,20041005
  ,20070905,20020814,20021125,20040429,20040429,20071205,20071205
  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
  ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
  ,20091224,20050503,19870508,19870508,19880330)

DG<-
c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1)

id.d<-data.frame(ID,DATE,DG)
id.d

# Considering Ruis  getRepeat function:

g.r<-getRepeat(id.d)    # defaults to first = TRUE getRepeat(id.d, first = 
FALSE)  to get the last ones
g.rr<-do.call(rbind, g.r) # put the data into a matrix

# I can remove the date duplicates with:
g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]

I'm not sure how to add this to your suggestions, Arun & Petr...

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:petr.pi...@precheza.cz]
Sent: 23 October 2012 15:24
To: Stuart Leask
Subject: RE: [r] How to pick colums from a ragged array?

Hi

I assumed that id.d is data frame

id.d <- data.frame (ID,DATE )

and

fff(id.d)

works for me

Petr

-----Original Message-----
From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk]
Sent: Tuesday, October 23, 2012 3:13 PM
To: PIKAL Petr
Subject: RE: [r] How to pick colums from a ragged array?

Hi Petr.
I see what you mean it should do, but when I run it I get an error
(see below).
Stuart

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323

+ ,547,794,814,814,814,814,814,814,841,841,841,841,841
+ ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
+ ,1019)

DATE <-

+  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
+  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
+  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
+  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
+  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
+  ,20091224,20050503,19870508,19870508,19880330)

  id.d <- cbind (ID,DATE )
fff<-function(data, first=TRUE, remove=FALSE) {

+
+ testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
+ x[nrow(x),2]==x[nrow(x)-1,2]
+
+ if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
+ data[,1]), testfirst))))) else sel <-
+ as.numeric(names(which(unlist(sapply(split(data, data[,1]),
+ testlast)))))
+
+ if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
+ }

fff(id.d)

Error in x[1, 2] : incorrect number of dimensions
-----Original Message-----
From: PIKAL Petr [mailto:petr.pi...@precheza.cz]
Sent: 23 October 2012 13:51
To: Stuart Leask; r-help@r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi

-----Original Message-----
From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk]
Sent: Tuesday, October 23, 2012 2:29 PM
To: PIKAL Petr; r-help@r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only
when the DATE is the earliest (or last) date for that ID.

And that is what the function (with 3 small modifications) does


fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
x[nrow(x),2]==x[nrow(x)-1,2]

if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
data[,1]), testfirst))))) else sel <-
as.numeric(names(which(unlist(sapply(split(data, data[,1]),
testlast)))))

if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] }

See the result of your refined data

fff(id.d)
      ID       DATE
5   167 2004-02-05
6   167 2004-02-05
22  841 2005-04-21
23  841 2005-04-21
24  841 2006-04-28
25  841 2006-06-02
26  841 2006-08-16
27  841 2006-10-25
28  841 2006-11-29
29  841 2007-01-12
30  841 2007-05-14
38 1019 1987-05-08
39 1019 1987-05-08
40 1019 1988-03-30

fff(id.d, first=F)

    ID       DATE
5 167 2004-02-05
6 167 2004-02-05

fff(id.d, remove=T)

     ID       DATE
1   58 2006-08-21
2   58 2006-12-07
3   58 2008-01-02
4   58 2009-09-04
7  323 2005-11-11
8  323 2006-01-11
9  323 2007-11-19
10 323 2008-01-07
11 323 2008-04-07
12 323 2008-05-21
13 323 2008-07-11
14 547 2004-10-05
15 794 2007-09-05
16 814 2002-08-14
17 814 2002-11-25
18 814 2004-04-29
19 814 2004-04-29
20 814 2007-12-05
21 814 2008-02-27
31 910 1987-05-08
32 910 2004-02-05
33 910 2004-02-05
34 910 2009-11-20
35 910 2009-12-10
36 910 2009-12-24
37 999 2005-05-03
You can do surgery on fff function to see what result comes from some
piece of the function e.g.

sapply(split(id.d, id.d[,1]), testlast)

Regards
Petr

I have refined my test dataset, to include some tests (e.g. 910 has
the same dup as 1019, but for 910 it's not the earliest date):


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
  ,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:petr.pi...@precheza.cz]
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help@r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still
needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data,
data[,1]),
testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }

fff(id.d)

     ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

fff(id.d, remove=T)

      ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
Regards
Petr

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
project.org] On Behalf Of PIKAL Petr
Sent: Tuesday, October 23, 2012 1:49 PM
To: Stuart Leask; r-help@r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi

I did not check your code and rather followed your explanation.

BTW,

thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]


two functions to test if first two dates are the same or last two
dates are the same

testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
    58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE

sapply(split(id.d, id.d$ID), testfirst)

    58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove id.d$ID ==
as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))

[1]

FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

FALSE

FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

FALSE

FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
project.org] On Behalf Of Stuart Leask
Sent: Tuesday, October 23, 2012 11:38 AM
To: r-help@r-project.org
Subject: [R] [r] How to pick colums from a ragged array?

I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on

that

date).
Patients may have been assigned more than one diagnosis at any

one

appointment - leading to two rows, same ID and DATE but
different DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more
than one diagnosis at their first (or last) appointment. These
individuals I need to identify and remove, as it's not possible

to

say uniquely what their first (or last) diagnosis was. So I need
to identify and remove these individuals which have pairs of
rows with the same ID

and

(lowest or highest) DATE. The size of the dataset precludes the

option

of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:


-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for

which

this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).

-          (Then do the same to get the 'last appointment'

duplicates,

by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the
data out of the ragged array by columns (so e.g. I end up with a
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out
by column from the ragged array.

I hope someone can help. My ugly code is below, with some data

for

testing.


Stuart


Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
Lecturer and Honorary Consultant Pychiatrist Institute of Mental
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.

UK

Tel. +44
115 82 30419

stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk

Google 'Dr Stuart Leask'


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               #

create

ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up

the

dates per row
# Since DATE is in 'year mo da', if there's only one date, sum
will

be

less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all

the

IDs

with >1 date
rag.am <- rag.a [ multi.dates ]                           #

rag.am

only

has IDs with > 1 Date


# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the
addressee and may contain confidential information. If you have
received this message in error, please send it back to me, and
immediately delete

it.

Please do not use, copy or disclose the information contained in
this message or in any attachment.  Any views or opinions
expressed by the author of this email do not necessarily reflect
the views of the University of Nottingham.

This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage
your computer system:
you are advised to perform your own checks. Email communications
with the University of Nottingham may be monitored as permitted

by

UK legislation.
         [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-

project.org/posting-

guide.html and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [r] How to pick colums from a ragged array?

Reply via email to