I made a couple of a changes from the previous version:
- don't use functions anyMissing or notSorted (which aren't in base R)
- don't check for dup.row.names attribute (need to modify other functions
before that is useful)
I have not tested this with a wide variety of inputs; I'm assuming that
you have some regression tests.
Here are the file differences. Let me know if you'd like a different
format.
$ diff -c dataframe.R dataframe2.R
*** dataframe.R Thu Jul 3 15:48:12 2008
--- dataframe2.R Thu Jul 3 16:36:46 2008
***************
*** 530,535 ****
--- 530,541 ----
x <- .Call("R_copyDFattr", xx, x, PACKAGE="base")
oldClass(x) <- attr(x, "row.names") <- NULL
+ # Do not want to check for duplicates if don't need to
+ noDuplicateRowNames <- (is.logical(i) ||
+ length(i) < 2 ||
+ (is.numeric(i) && min(i, 0, na.rm=TRUE) < 0)
||
+ (!any(is.na(i)) && all(i[-length(i)]<i[-1])))
+
if(!missing(j)) { # df[i, j]
x <- x[j]
cols <- names(x) # needed for 'drop'
***************
*** 579,592 ****
## row names might have NAs.
if(is.null(rows)) rows <- attr(xx, "row.names")
rows <- rows[i]
! if((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) {
! ## both will coerce integer 'rows' to character:
! if (!dup && is.character(rows)) dup <- "NA" %in% rows
! if(ina)
! rows[is.na(rows)] <- "NA"
! if(dup)
! rows <- make.unique(as.character(rows))
! }
## new in 1.8.0 -- might have duplicate columns
if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm)
if(is.null(rows)) rows <- attr(xx, "row.names")[i]
--- 585,594 ----
## row names might have NAs.
if(is.null(rows)) rows <- attr(xx, "row.names")
rows <- rows[i]
! if(any(is.na(rows)))
! rows[is.na(rows)] <- "NA" # coerces to integer
! if(!noDuplicateRowNames && any(duplicated(rows)))
! rows <- make.unique(as.character(rows)) # coerces to integer
## new in 1.8.0 -- might have duplicate columns
if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm)
if(is.null(rows)) rows <- attr(xx, "row.names")[i]
Here's some code for testing, and timings
# Use:
# R --no-init-file --no-site-file
x <- data.frame(a=1:4, b=2:5)
# Run these commands with the default and new versions of [.data.frame
trace("duplicated")
trace("make.unique")
x[2:1]
x[1]
x[1:2]
x[1:3, ] # save one call to duplicated(rows)
x[c(T,F,F,T), ] # save one call to duplicated(rows)
x[-1,] # save one call to duplicated(rows)
x[-(1:2),] # save one call to duplicated(rows)
x[3:1, ]
x[c(1,3,2,4,3), ]
untrace("duplicated")
untrace("make.unique")
# Timings
# Run one of these lines, then everything afterward
n <- 10^5
n <- 10^6
n <- 10^7
y <- data.frame(a=1:n, b=1:n)
i <- 1:n
system.time(temp <- y[i, ])
# n old new
# 10^5 .128 .052
# 10^6 .237 .591
# 10^7 3.10 2.882
i <- rep(TRUE, n)
system.time(temp <- y[i, ])
# n old new
# 10^5 .157 .053
# 10^6 .787 .449
# 10^7 3.799 2.138
i <- -1
system.time(temp <- y[i, ])
# n old new
# 10^5 .157 .051
# 10^6 .614 .497
# 10^7 4.163 2.482
i <- rep(1:(n/2), 2) # expect no speedup for this case
system.time(temp <- y[i, ])
# n old new
# 10^5 .559 .782
# 10^6 6.066 6.078
# Times shown are the user times reported by system.time
# The time savings are mostly quite substantial in the
# cases I expect a savings.
# I've noticed a lot of variability in results from system.time,
# so I don't view these as very accurate, and I don't worry
# much about the cases where the time appears worse.
On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler <[EMAIL PROTECTED]>
wrote:
> >>>>> "TH" == Tim Hesterberg <[EMAIL PROTECTED]>
> >>>>> on Tue, 1 Jul 2008 15:23:53 -0700 writes:
>
> TH> There is a bug in the standard version of [.data.frame;
> TH> it mixes up handling duplicates and NAs when subscripting rows.
>
> TH> x <- data.frame(x=1:3, y=2:4, row.names=c("a","b","NA"))
> TH> y <- x[c(2:3, NA),]
> TH> y
>
> TH> It creates a data frame with duplicate rows, but won't print.
>
> and that's a bug, indeed
> ("introduced" to R version 2.5.0, when the [.data.frame code was much
> optimized for speed, with quite some care), and I have commited
> a fix (and a regression test) to both R-devel and R-patched.
>
> Thanks a lot for the bug report, Tim!
>
> Now about your newly proposed code:
> I'm sorry to say that it looks so much different from the source
> code in
> https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
> that I don't think we would accept it as a substitute, easily.
>
> Could you try to provide a minimal patch against the source code
> and also a selfcontained example that exhibits the speed gain
> you are aiming for ?
>
> Best regards,
> Martin Maechler, ETH Zurich
>
> [.........................]
>
>
> TH> On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg <
> [EMAIL PROTECTED]>
> TH> wrote:
>
> >> Below is a version of [.data.frame that is faster
> >> for subscripting rows of large data frames; it avoids calling
> >> duplicated(rows)
> >> if there is no need to check for duplicate row names, when:
> >> i is logical
> >> attr(x, "dup.row.names") is not NULL (S+ compatibility)
> >> i is numeric and negative
> >> i is strictly increasing
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel