Data frames have unique row names *by definition* (White Book p.57).
Note that R is extensible, so any package writer has (for 14 years since
the White Book) been entitled to assume that. A minimum test suite is to
run R CMD check on all CRAN packages, and to read all the relevant
documentation. That would reveal a large number of uses of row names and
of their uniqueness.
On Mon, 12 Dec 2005, Matthew Dowle wrote:
I guess the mail list precludes attachments then, makes sense. I have sent
the modified source directly to anyone who has asked.
I had a look at the light-weight data.frame class post
(http://tolstoy.newcastle.edu.au/R/devel/05/05/0837.html) :
Now the transcript itself:
# the motivation: subscription of a data.frame is *much* (almost 20
times) slower than that of a list
# compare
n = 1e6
i = seq(n)
x = data.frame(a=seq(n), b=seq(n))
system.time(x[i,], gcFirst=TRUE)
[1] 1.01 0.14 1.14 0.00 0.00
x = list(a=seq(n), b=seq(n))
system.time(lapply(x, function(col) col[i]), gcFirst=TRUE)
[1] 0.06 0.00 0.06 0.00 0.00
# the solution: define methods for the light-weight data.frame class
lwdf = function(...) structure(list(...), class = "lwdf")
...
But if I have understood correctly I think the time difference here is just
down to the rownames. The rownames are 1:n stored in character form. This
takes the most time and space in this example, but are never used. I'm not
sure why 1:n in character form would ever be useful in fact. Running the
example above with my modifications appears to fix the problem ie negligible
time difference. I needed to make a one line change to [.data.frame, and
I've sent that to anyone who requested the code.
I can see the problem :
apropos("data.frame")
[1] "[.data.frame" "as.matrix.data.frame"
"data.frame" "dim.data.frame"
[5] "format.data.frame" "print.data.frame"
".__C__data.frame" "aggregate.data.frame"
[9] "$<-.data.frame" "Math.data.frame"
"Ops.data.frame" "Summary.data.frame"
[13] "[.data.frame" "[<-.data.frame"
"[[.data.frame" "[[<-.data.frame"
[17] "as.data.frame" "as.data.frame.AsIs"
"as.data.frame.Date" "as.data.frame.POSIXct"
[21] "as.data.frame.POSIXlt" "as.data.frame.array"
"as.data.frame.character" "as.data.frame.complex"
[25] "as.data.frame.data.frame" "as.data.frame.default"
"as.data.frame.factor" "as.data.frame.integer"
[29] "as.data.frame.list" "as.data.frame.logical"
"as.data.frame.matrix" "as.data.frame.model.matrix"
[33] "as.data.frame.numeric" "as.data.frame.ordered"
"as.data.frame.package_version" "as.data.frame.raw"
[37] "as.data.frame.table" "as.data.frame.ts"
"as.data.frame.vector" "as.list.data.frame"
[41] "as.matrix.data.frame" "by.data.frame"
"cbind.data.frame" "data.frame"
[45] "dim.data.frame" "dimnames.data.frame"
"dimnames<-.data.frame" "duplicated.data.frame"
[49] "format.data.frame" "is.data.frame"
"is.na.data.frame" "mean.data.frame"
[53] "merge.data.frame" "print.data.frame"
"rbind.data.frame" "row.names.data.frame"
[57] "row.names<-.data.frame" "rowsum.data.frame"
"split.data.frame" "split<-.data.frame"
[61] "stack.data.frame" "subset.data.frame"
"summary.data.frame" "t.data.frame"
[65] "transform.data.frame" "unique.data.frame"
"unstack.data.frame" "xpdrows.data.frame"
But I think the changes would be quick to make. Is anything else effected?
Do any test suites exist to confirm R hasn't broken?
On the face of it allowing data frames to have null row names seems a small
change, and would make them consistent with matrices, with large time and
space benefits. However, I can see the argument for a new class instead for
safety. Whats the consenus?
-----Original Message-----
From: Hin-Tak Leung [mailto:[EMAIL PROTECTED]
Sent: 09 December 2005 18:41
To: Gabor Grothendieck
Cc: Matthew Dowle; r-devel@r-project.org; Peter Dalgaard
Subject: Re: [Rd] [R] data.frame() size
Gabor Grothendieck wrote:
There was nothing attached in the copy that came through
to me.
I like to see that patch also.
By the way, there was some discussion earlier this year
on a light-weight data.frame class but I don't think anyone ever
posted any code.
It may have been me. I am working on a bit-packed data.frame which only uses
2-bits per unit of data, so it is 4 units per RAWSXP. (work in progress,
nothing to show).
So I am very interested to see the patch.
Yes, I took a couple of weeks reading/learning where have all the memory
gone in data.frame. The rowname/column names allocation is a bit stupid.
Each rowname and each column name is a full R object, so there is a 32(or
28) byte overhead just from managing that, before the STRSXP for the actual
string, which is another X bytes. so for an 1 x N data.frame with integers
for content, the the content is 4-byte * N, but the rowname/columnname is 32
* N -ish. (a 9x increase). Word is 32-bit on most people's machines, and I
am counting the extra one from which you have to keep the address of each
SEXPREC somewhere, so it is 7+1 = 8, if I understand it correctly.
Here is the relevant comment, quoted verbatum from around line 225 of
"src/include/Rinternals.h":
/* The generational collector uses a reduced version of SEXPREC as a
header in vector nodes. The layout MUST be kept consistent with
the SEXPREC definition. The standard SEXPREC takes up 7 words on
most hardware; this reduced version should take up only 6 words.
In addition to slightly reducing memory use, this can lead to more
favorable data alignment on 32-bit architectures like the Intel
Pentium III where odd word alignment of doubles is allowed but much
less efficient than even word alignment. */
Hin-Tak Leung
On 12/9/05, Matthew Dowle <[EMAIL PROTECTED]> wrote:
Hi,
Please see below for post on r-help regarding data.frame() and the
possibility of dropping rownames, for space and time reasons. I've
made some changes, attached, and it seems to be working well. I see
the expected space (90% saved) and time (10 times faster) savings.
There are no doubt some bugs, and needs more work and testing, but I
thought I would post first at this stage.
Could some changes along these lines be made to R ? I'm happy to help
with testing and further work if required. In the meantime I can work
with overloaded functions which fixes the problems in my case.
Functions effected :
dim.data.frame
format.data.frame
print.data.frame
data.frame
[.data.frame
as.matrix.data.frame
Modified source code attached.
Regards,
Matthew
-----Original Message-----
From: Matthew Dowle
Sent: 09 December 2005 09:44
To: 'Peter Dalgaard'
Cc: 'r-help@stat.math.ethz.ch'
Subject: RE: [R] data.frame() size
That explains it. Thanks. I don't need rownames though, as I'll only
ever use integer subscripts. Is there anyway to drop them, or even
better not create them in the first place? The memory saved (90%) by
not having them and 10 times speed up would be very useful. I think I
need a data.frame rather than a matrix because I have columns of
different types in real life.
rownames(d) = NULL
Error in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" :
invalid 'dimnames' given for data frame
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
Peter Dalgaard
Sent: 08 December 2005 18:57
To: Matthew Dowle
Cc: 'r-help@stat.math.ethz.ch'
Subject: Re: [R] data.frame() size
Matthew Dowle <[EMAIL PROTECTED]> writes:
Hi,
In the example below why is d 10 times bigger than m, according to
object.size ? It also takes around 10 times as long to create, which
fits with object.size() being truthful. gcinfo(TRUE) also indicates
a great deal more garbage collector activity caused by data.frame()
than matrix().
$ R --vanilla
....
nr = 1000000
system.time(m<<-matrix(integer(1), nrow=nr, ncol=2))
[1] 0.22 0.01 0.23 0.00 0.00
system.time(d<<-data.frame(a=integer(nr), b=integer(nr)))
[1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer
dim(m)
[1] 1000000 2
dim(d)
[1] 1000000 2 # same dimensions
storage.mode(m)
[1] "integer"
sapply(d, storage.mode)
a b
"integer" "integer" # same storage.mode
object.size(m)/1024^2
[1] 7.629616
object.size(d)/1024^2
[1] 76.29482 # but 10 times bigger
sum(sapply(d, object.size))/1024^2
[1] 7.629501 # or is it ? If its not
really 10 times bigger, why 10 times longer above ?
Row names!!
r <- as.character(1:1e6)
object.size(r)
[1] 72000056
object.size(r)/1024^2
[1] 68.6646
'nuff said?
--
O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45)
35327918
~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45)
35327907
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
--
Brian D. Ripley, [EMAIL PROTECTED]
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel