Folks: I do not wish to agree or disagree with the criticisms of either the speed or possible design flaws of "[". But let's at least see what the docs say about the issues, using the simple example you provided:
m = matrix(1:9, 3, 3) md = data.frame(m) md[1] # the first column ## as documented. This is because a data frame is a list of 3 identical ## length columns, and this is how [ works for lists m[1] # the first element (i.e., m[1,1]) ## as documented. A matrix is just a vector with a dim attribute and ## this is how [ works for vectors md[,i=3] # third row ## See below m[,i=3] # third column ## Correct,as documented in ?"[" for matrices, to whit: "Note that these operations do not match their index arguments in the standard way: argument names are ignored and positional matching only is used. So m[j=2,i=1] is equivalent to m[2,1] and not to m[1,2]. " ## Note that the next lines immediately following say: "This may not be true for methods defined for them; for example it is not true for the data.frame methods described in [.data.frame. To avoid confusion, do not name index arguments (but drop and exact must be named). " So, while it may be fair to characterize the md[,i=3] as a design flaw, it is both explicitly pointed out and warned against. Note that,of course md[,3] ## 3rd column, good practice md[,j=3] ## also 3rd column .. but warned against as bad practice Whether a behavior should be considered a "bug" if it is explicitly warned against in the docs, I leave for others to decide. Too deep for me. Cheers, Bert -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Wacek Kusnierczyk Sent: Friday, March 27, 2009 2:28 PM To: Romain Francois fran; r-de...@r-project.org Cc: R help Subject: Re: [R] "[.data.frame" and lapply redirected to r-devel, because there are implementational details of [.data.frame discussed here. spoiler: at the bottom there is a fairly interesting performance result. Romain Francois wrote: > > Hi, > > This is a bug I think. [.data.frame treats its arguments differently > depending on the number of arguments. you might want to hesitate a bit before you say that something in r is a bug, if only because it drives certain people mad. r is a carefully tested software, and [.data.frame is such a basic function that if what you talk about were a bug, it wouldn't have persisted until now. treating the arguments differently depending on their number is actually (if clearly...) documented: if there is one index (the 'i'), it selects columns. if there are two, 'i' selects rows. however, not all seems fine, there might be a design flaw: # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as i, no j given d[,1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[,i=1:2] # correctly selects two first rows # 1:2 passed to [.data.frame as i, j given the missing argument value (note the comma) d[j=1:2,] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[i=1:2] # correctly (arguably) selects the first two columns # 1:2 passed to [.data.frame as i, no j given d[j=1:2] # wrong: returns the whole data frame # does not recognize the index as i because it is explicitly named 'j' # does not recognize the index as j because there is only one index i say this *might* be a design flaw because it's hard to judge what the design really is. the r language definition (!) [1, sec. 3.4.3 p. 18] says: " The most important example of a class method for [ is that used for data frames. It is not be described in detail here (see the help page for [.data.frame, but in broad terms, if two indices are supplied (even if one is empty) it creates matrix-like indexing for a structure that is basically a list of vectors of the same length. If a single index is supplied, it is interpreted as indexing the list of columns-in that case the drop argument is ignored, with a warning." it does not say what happens when only one *named* index argument is given. from the above, it would indeed seem that there is a *bug* here: in the last example above only one index is given, and yet columns are not selected, even though the *language definition* says they should. (so it's not a documented feature, it's a contra-definitional misfeature -- a bug?) somewhat on the side, the 'matrix-like indexing' above is fairly misleading; just try the same patterns of indexing -- one index, two indices, named indices -- on a data frame and a matrix of the same shape: m = matrix(1:9, 3, 3) md = data.frame(m) md[1] # the first column m[1] # the first element (i.e., m[1,1]) md[,i=3] # third row m[,i=3] # third column the quote above refers to the ?'[.data.frame' for details. unfortunately, it the help page a lump of explanations for various '['-like operators, and it is *not* a definition of any sort. it does not provide much more detail on '[.data.frame' -- it is hardly as a design specification. in particular, it does not explain the issue of named arguments to '[.data.frame' at all. `[.data.frame` only is called with two arguments in the second case, > so > the following condition is true: > > if(Narg < 3L) { # list-like indexing or matrix indexing > > And then, the function assumes the argument it has been passed is i, > and > eventually calls NextMethod("[") which I think calls > `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not > passed to `[.listof`, so you have something equivalent to as.list(d) > []. > > I think we can replace the condition with this one: > > if(Narg < 3L && !has.j) { # list-like indexing or matrix indexing > > or this: > > if(Narg < 3L) { # list-like indexing or matrix indexing > if(has.j) i <- j > indeed, for a moment i thought a trivial fix somewhere there would suffice. unfortunately, the code for [.data.frame [2, lines 500-641] is so clean and readable that i had to give up reading it, forget fixing. instead, i wrote an new version of '[.data.frame' from scratch. it fixes (or at least seems to fix, as far as my quick assessment goes) the problem. the function subdf (see the attached dataframe.r) is the new version of '[.data.frame': # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[j=1:2] # incorrect: the whole data frame subdf(d, j=1:2) # correct, only the first two columns otherwise, subdf returns results equivalent (sensu all.equal; see below) to those returned by [.data.frame on the same input, modulo some more or less minor details. for example, i think the dropped-drop warnings go wrong in the original: d[1, drop=FALSE] # warning: drop argument will be ignored which suggests that dimensions will be dropped, while the intention is that the actual argument will be ignored and the value will be FALSE instead (while the default is TRUE, since i is specified). well, it's just one more confusing bit in r. the rewritten version warns about dropped drop only if it is explicitly TRUE: subdf(d, 1, drop=FALSE) # no warning subdf(d, 1, drop=TRUE) # warning another issue the differs in my version is that i don't see much sense in being able to select rows by indexing with NA: d[NA,1] # one row filled with NAs d[NA,] # data frame of the shape of d, filled with NAs which is incoherent with how NA are treated in columns indices (i.e., raise an error). the rewritten version raises an error if any element of any index is an NA. these minor differences are easily modifiable should compliance with the original 'design' be desirable. interestingly, there is a reduction in code by some 40 lines (~30%) wrt. the original, even though the new code is quite redundant (but thus were the original, too). with a little effort, it can be compressed further, but i felt it would become more convoluted and less readable, and also less efficient. procedural abstraction could help, but would also negatively impact performance. (presumably, an implementation in c would run faster.) incidentally (here's the best part!), my version seems to perform much better than the original, at least in a limited set of naive benchmarks. here are some results, which you can (hopefully) reproduce using the code in the attached test.r. the data is a dummy df with 1k rows and 1k columns, filled with rnorm; each indexing was repeated 1000 times for both the original and the modified version: original patched ratio test 1 0.002 0.001 2.00 d[] 2 0.027 0.001 27.00 d[drop = FALSE] 3 0.025 0.002 12.50 d[drop = TRUE] 4 0.026 0.002 13.00 d[, drop = FALSE] 5 0.026 0.003 8.67 d[, drop = TRUE] 6 1.274 0.002 637.00 d[, ] 7 1.255 0.001 1255.00 d[, , ] 8 1.183 0.001 1183.00 d[, , drop = FALSE] 9 1.183 0.003 394.33 d[, , drop = TRUE] 10 0.013 0.011 1.18 d[r] 11 0.040 0.034 1.18 d[r, drop = TRUE] 12 0.037 0.010 3.70 d[r, drop = FALSE] 13 0.012 0.011 1.09 d[i = r] 14 0.036 0.034 1.06 d[i = r, drop = TRUE] 15 0.037 0.011 3.36 d[i = r, drop = FALSE] 16 0.222 0.163 1.36 d[rr] 17 0.247 0.112 2.21 d[rr, drop = FALSE] 18 0.204 0.144 1.42 d[rr, drop = TRUE] 19 0.174 0.120 1.45 d[i = rr] 20 0.201 0.125 1.61 d[i = rr, drop = FALSE] 21 0.215 0.147 1.46 d[i = rr, drop = TRUE] 22 2.266 1.159 1.96 d[rr, ] 23 2.236 1.164 1.92 d[rr, , drop = FALSE] 24 2.275 1.171 1.94 d[rr, , drop = TRUE] 25 2.269 1.165 1.95 d[i = rr, ] 26 2.264 1.155 1.96 d[i = rr, , drop = FALSE] 27 2.290 1.189 1.93 d[i = rr, , drop = TRUE] 28 2.301 1.198 1.92 d[, i = rr] 29 2.239 1.158 1.93 d[, i = rr, drop = FALSE] 30 2.310 1.161 1.99 d[, i = rr, drop = TRUE] 31 0.002 0.003 0.67 d[j = c] 32 0.026 0.011 2.36 d[j = c, drop = FALSE] 33 0.026 0.003 8.67 d[j = c, drop = TRUE] 34 0.001 0.111 0.01 d[j = cc] 35 0.025 0.110 0.23 d[j = cc, drop = FALSE] 36 0.025 0.111 0.23 d[j = cc, drop = TRUE] 37 0.243 0.051 4.76 d[rr, cc] 38 0.243 0.051 4.76 d[rr, cc, drop = FALSE] 39 0.244 0.050 4.88 d[rr, cc, drop = TRUE] 40 0.244 0.051 4.78 d[i = rr, cc] 41 0.243 0.050 4.86 d[i = rr, cc, drop = FALSE] 42 0.244 0.051 4.78 d[i = rr, cc, drop = TRUE] 43 0.243 0.052 4.67 d[cc, i = rr] 44 0.244 0.050 4.88 d[cc, i = rr, drop = FALSE] 45 0.247 0.052 4.75 d[cc, i = rr, drop = TRUE] 46 0.244 0.050 4.88 d[i = rr, j = cc] 47 0.244 0.051 4.78 d[i = rr, j = cc, drop = FALSE] 48 0.244 0.051 4.78 d[i = rr, j = cc, drop = TRUE] 49 0.244 0.051 4.78 d[j = cc, i = rr] 50 0.243 0.051 4.76 d[j = cc, i = rr, drop = FALSE] 51 0.245 0.051 4.80 d[j = cc, i = rr, drop = TRUE] 52 0.002 0.155 0.01 d[j = cn] 53 0.429 0.139 3.09 d[i = rn, j = cn] 54 1.791 0.690 2.60 d[i = c(TRUE, FALSE), j = c(FALSE, TRUE)] (note: the benchmark relies on a feature of rbenchmark that i have just added, so you may need to download/update the package before trying.) in some tests, the difference is two orders of magnitude; in some it's a factor of 2-5; in some there's no significant difference. in only a few cases, the original is way faster (e.g., tests 34 and 52), but this is because the original is wrong there (it simply ignores the index, so no wonder). all the expressions above used in benchmarking were also used to test the equivalence of output from the original and the new version (see test.r again), and all of them were negative (no difference) -- except for the cases where the original was wrong. i'd consider making a patch for src/library/base/R/dataframe.R, but there's a hack here: it seems that some code relies on some part of the 'design' that differs between the rewrite and the original, and the new code does not make (dataframe.R does, but then other sources fail). anyway, sourcing the attached dataframe.R suffices for testing. i will be happy to learn where my implementation, benchmarking, and/or result checking are naive or wrong in any way, as they surely are. vQ [1] http://cran.r-project.org/doc/manuals/R-lang.pdf [2] http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.