Re: [Rd] [.data.frame speedup

Tim Hesterberg Thu, 03 Jul 2008 17:05:42 -0700

I made a couple of a changes from the previous version:
 - don't use functions anyMissing or notSorted (which aren't in base R)
 - don't check for dup.row.names attribute (need to modify other functions
   before that is useful)
I have not tested this with a wide variety of inputs; I'm assuming that
you have some regression tests.


Here are the file differences.  Let me know if you'd like a different
format.

$ diff -c dataframe.R dataframe2.R
*** dataframe.R    Thu Jul  3 15:48:12 2008
--- dataframe2.R    Thu Jul  3 16:36:46 2008
***************
*** 530,535 ****
--- 530,541 ----
      x <- .Call("R_copyDFattr", xx, x, PACKAGE="base")
      oldClass(x) <- attr(x, "row.names") <- NULL

+     # Do not want to check for duplicates if don't need to
+     noDuplicateRowNames <- (is.logical(i) ||
+                             length(i) < 2 ||
+                             (is.numeric(i) && min(i, 0, na.rm=TRUE) < 0)
||
+                             (!any(is.na(i)) && all(i[-length(i)]<i[-1])))
+
      if(!missing(j)) { # df[i, j]
          x <- x[j]
          cols <- names(x)  # needed for 'drop'
***************
*** 579,592 ****
          ## row names might have NAs.
          if(is.null(rows)) rows <- attr(xx, "row.names")
          rows <- rows[i]
!     if((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) {
!         ## both will coerce integer 'rows' to character:
!         if (!dup && is.character(rows)) dup <- "NA" %in% rows
!         if(ina)
!         rows[is.na(rows)] <- "NA"
!         if(dup)
!         rows <- make.unique(as.character(rows))
!     }
          ## new in 1.8.0  -- might have duplicate columns
          if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm)
          if(is.null(rows)) rows <- attr(xx, "row.names")[i]
--- 585,594 ----
          ## row names might have NAs.
          if(is.null(rows)) rows <- attr(xx, "row.names")
          rows <- rows[i]
!         if(any(is.na(rows)))
!           rows[is.na(rows)] <- "NA" # coerces to integer
!         if(!noDuplicateRowNames && any(duplicated(rows)))
!           rows <- make.unique(as.character(rows)) # coerces to integer
          ## new in 1.8.0  -- might have duplicate columns
          if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm)
          if(is.null(rows)) rows <- attr(xx, "row.names")[i]



Here's some code for testing, and timings

# Use:
# R --no-init-file --no-site-file

x <- data.frame(a=1:4, b=2:5)

# Run these commands with the default and new versions of [.data.frame
trace("duplicated")
trace("make.unique")
x[2:1]
x[1]
x[1:2]
x[1:3, ]                # save one call to duplicated(rows)
x[c(T,F,F,T), ]         # save one call to duplicated(rows)
x[-1,]                  # save one call to duplicated(rows)
x[-(1:2),]              # save one call to duplicated(rows)
x[3:1, ]
x[c(1,3,2,4,3), ]
untrace("duplicated")
untrace("make.unique")


# Timings
# Run one of these lines, then everything afterward
n <- 10^5
n <- 10^6
n <- 10^7

y <- data.frame(a=1:n, b=1:n)

i <- 1:n
system.time(temp <- y[i, ])
#       n       old     new
#       10^5    .128    .052
#       10^6    .237    .591
#       10^7    3.10    2.882

i <- rep(TRUE, n)
system.time(temp <- y[i, ])
#       n       old     new
#       10^5    .157    .053
#       10^6    .787    .449
#       10^7    3.799   2.138

i <- -1
system.time(temp <- y[i, ])
#       n       old     new
#       10^5    .157    .051
#       10^6    .614    .497
#       10^7    4.163   2.482

i <- rep(1:(n/2), 2) # expect no speedup for this case
system.time(temp <- y[i, ])
#       n       old     new
#       10^5    .559    .782
#       10^6    6.066   6.078

# Times shown are the user times reported by system.time

# The time savings are mostly quite substantial in the
# cases I expect a savings.

# I've noticed a lot of variability in results from system.time,
# so I don't view these as very accurate, and I don't worry
# much about the cases where the time appears worse.


On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler <[EMAIL PROTECTED]>
wrote:

> >>>>> "TH" == Tim Hesterberg <[EMAIL PROTECTED]>
> >>>>>     on Tue, 1 Jul 2008 15:23:53 -0700 writes:
>
>    TH> There is a bug in the standard version of [.data.frame;
>    TH> it mixes up handling duplicates and NAs when subscripting rows.
>
>    TH> x <- data.frame(x=1:3, y=2:4, row.names=c("a","b","NA"))
>    TH> y <- x[c(2:3, NA),]
>    TH> y
>
>    TH> It creates a data frame with duplicate rows, but won't print.
>
> and that's a bug, indeed
> ("introduced" to R version 2.5.0, when the [.data.frame  code was much
> optimized for speed, with quite some care), and I have commited
> a fix (and a regression test) to both R-devel and R-patched.
>
> Thanks a lot for the bug report, Tim!
>
> Now about your newly proposed code:
> I'm sorry to say that it looks so much different from the source
> code in
>      https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
> that I don't think we would accept it as a substitute, easily.
>
> Could you try to provide a minimal patch against the source code
> and also a selfcontained example that exhibits the speed gain
> you are aiming for ?
>
> Best regards,
> Martin Maechler, ETH Zurich
>
> [.........................]
>
>
>    TH> On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg <
> [EMAIL PROTECTED]>
>     TH> wrote:
>
>    >> Below is a version of [.data.frame that is faster
>    >> for subscripting rows of large data frames; it avoids calling
>    >> duplicated(rows)
>    >> if there is no need to check for duplicate row names, when:
>    >> i is logical
>    >> attr(x, "dup.row.names") is not NULL (S+ compatibility)
>    >> i is numeric and negative
>    >> i is strictly increasing
>

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [.data.frame speedup

Reply via email to