Re: [R] get latest dates for different people in a dataset

Göran Broström Sun, 25 Jan 2015 12:40:57 -0800

See inline;

On 2015-01-25 20:27, William Dunlap wrote:

 >> dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
 >
 >I guess it is faster, but who knows?


You can find out by making a function that generates datasets of
various sizes and timing the suggested algorithms.  E.g.,
makeData <-
function(nPatients, aveVisitsPerPatient, uniqueNameDate = TRUE){
     nrow <- trunc(nPatients * aveVisitsPerPatient)
     patientNames <- paste0("P",seq_len(nPatients))
     possibleDates <- as.Date(16001:17000, origin=as.Date("1970-01-01"))
     possibleTemps <- seq(97, 103, by=0.1)
     data <- data.frame(Name=sample(patientNames, replace=TRUE, size=nrow),
                CheckInDate=sample(possibleDates, replace=TRUE, size=nrow),
                Temp=sample(possibleTemps, replace=TRUE, size=nrow))
     if (uniqueNameDate) {
         data <- data[!duplicated(data[, c("Name", "CheckInDate")]), ]
     }
     data
}
funs <- list(
     f1 = function(data) {
         do.call(rbind, lapply(split(data, data$Name), function(x)
x[order(x$CheckInDate),][nrow(x),]))
     }, f2 = function (d)
     {
         isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
         dSorted <- d[order(d$Name, d$CheckInDate), ]
         dSorted[isEndOfRun(dSorted$Name), ]
     }, f3 = function (d)
     {
         # is the following how you did reverse sort on date (& fwd on
name)?

Yes; in fact I do this all the time in my applications (survivalanalysis), where I have several records for each individual.


Göran

         #  Too bad that order's decreasing arg is not vectorized
         dSorted <- d[order(d$Name, -as.numeric(d$CheckInDate)), ]
         dSorted[!duplicated(dSorted$Name), ]
     }, f4 = function(dta)
     {
         dta %>% group_by(Name)  %>% filter(CheckInDate==max(CheckInDate))
     })

D <- makeData(nPatients=35000, aveVisitsPerPatient=3.7) # c. 129000 visits
library(dplyr)
Z <- lapply(funs, function(fun){
     time <- system.time( result <- fun(D) ) ; list(time=time,
result=result) })

sapply(Z, function(x)x$time)
#               f1   f2   f3   f4
#user.self  461.25 0.47 0.36 3.01
#sys.self     1.20 0.00 0.00 0.01
#elapsed    472.33 0.47 0.39 3.03
#user.child     NA   NA   NA   NA
#sys.child      NA   NA   NA   NA

# duplicated is a bit better than diff, dplyr rather slower, rbind much
slower.

equivResults <- function(a, b) {
    # results have different classes and different orders, so only check
size and contents
     identical(dim(a),dim(b)) && all(a[order(a$Name),]==b[order(b$Name),])
}
sapply(Z[-1], function(x)equivResults(x$result, Z[[1]]$result))
#  f2   f3   f4
#TRUE TRUE TRUE

Note that the various functions give different results if any patient comes
in twice on the same day.  f4 includes both visits in the ouput, the other
include either the first or last (as ordered in the original file).

Bill Dunlap
TIBCO Software
wdunlap tibco.com <http://tibco.com>

On Sun, Jan 25, 2015 at 1:01 AM, Göran Broström <goran.brost...@umu.se
<mailto:goran.brost...@umu.se>> wrote:

    On 2015-01-24 01:14, William Dunlap wrote:

        Here is one way.  Sort the data.frame, first by Name then break
        ties with
        CheckInDate.
        Then choose the rows that are the last in a run of identical
        Name values.


    I do it by sorting by the reverse order of CheckinDate (last date
    first) within Name, then

     > dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]

    I guess it is faster, but who knows?

    Göran


            txt <- "Name    CheckInDate      Temp

        + John      1/3/2014              97
        + Mary     1/3/2014              98.1
        + Sam       1/4/2014              97.5
        + John      1/4/2014              99"

            d <- read.table(header=TRUE,

        colClasses=c("character","__character","numeric"), text=txt)

            d$CheckInDate <- as.Date(d$CheckInDate, as.Date,
            format="%d/%m/%Y")
            isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
            dSorted <- d[order(d$Name, d$CheckInDate), ]
            dLatestVisit <- dSorted[isEndOfRun(dSorted$__Name), ]
            dLatestVisit

            Name CheckInDate Temp
        4 John 2014-04-01 99 <tel:2014-04-01%2099>.0
        2 Mary 2014-03-01 98 <tel:2014-03-01%2098>.1
        3  Sam 2014-04-01 97 <tel:2014-04-01%2097>.5


        Bill Dunlap
        TIBCO Software
        wdunlap tibco.com <http://tibco.com>


        On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <r...@panagora.com
        <mailto:r...@panagora.com>> wrote:

            Hi,

            Can someone help for a R question?

            I have a data set like:

            Name    CheckInDate      Temp
            John      1/3/2014              97
            Mary     1/3/2014              98.1
            Sam       1/4/2014              97.5
            John      1/4/2014              99

            I'd like to return a dataset that for each Name, get the row
            that is the
            latest CheckInDate for that person.  For the example above
            it would be

            Name    CheckInDate      Temp
            John      1/4/2014              99
            Mary     1/3/2014              98.1
            Sam       1/4/2014              97.5


            Thank you for your help!

            Richard


                      [[alternative HTML version deleted]]

            ________________________________________________
            R-help@r-project.org <mailto:R-help@r-project.org> mailing
            list -- To UNSUBSCRIBE and more, see
            https://stat.ethz.ch/mailman/__listinfo/r-help
            <https://stat.ethz.ch/mailman/listinfo/r-help>
            PLEASE do read the posting guide
            http://www.R-project.org/__posting-guide.html
            <http://www.R-project.org/posting-guide.html>
            and provide commented, minimal, self-contained, reproducible
            code.


                 [[alternative HTML version deleted]]

        ________________________________________________
        R-help@r-project.org <mailto:R-help@r-project.org> mailing list
        -- To UNSUBSCRIBE and more, see
        https://stat.ethz.ch/mailman/__listinfo/r-help
        <https://stat.ethz.ch/mailman/listinfo/r-help>
        PLEASE do read the posting guide
        http://www.R-project.org/__posting-guide.html
        <http://www.R-project.org/posting-guide.html>
        and provide commented, minimal, self-contained, reproducible code.


    ________________________________________________
    R-help@r-project.org <mailto:R-help@r-project.org> mailing list --
    To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/__listinfo/r-help
    <https://stat.ethz.ch/mailman/listinfo/r-help>
    PLEASE do read the posting guide
    http://www.R-project.org/__posting-guide.html
    <http://www.R-project.org/posting-guide.html>
    and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] get latest dates for different people in a dataset

Reply via email to