Stavros Macrakis wrote:
On Sat, Jan 3, 2009 at 7:02 PM,  <l...@stat.uiowa.edu> wrote:
R's interpreter is fairly slow due in large part to the allocation of
argument lists and the cost of lookups of variables, including ones
like [<- that are assembled and looked up as strings on every call.

Wow, I had no idea the interpreter was so awful. Just some simple
tree-to-tree transformations would speed things up, I'd think, e.g.
`<-`(`[`(...), ...) ==> `<-[`(...,...).

Doesn't really help (and it's not quite correct: a[2] <- 1 is equivalent to

a <- `[<-`(a, 2,  1)

with some sneakiness that assumes that the two a's are the same, so that you might destructively modify the second instance.)

The actual interpreter is not much of a bottleneck. There are two other major obstacles:

1) Things may not be what they seem

2) Insufficient control over object duplication


1) is the major impediment to compilability (look for talks/papers by Luke for further details and ideas about what to do about it). The basic issue is that at no point can you be sure that the "log" function calculates logarithms. It might be redefined as a side effect of the previous expression. This is a feature of the language as such, and it is difficult to change without destroying features that people actually use. The upshot is that every time we see an object name, we enter a search along the current search path to find it's current binding.

2) is a little contentious: It is not certain how much we gain by attacking it, only that it would be a heck of a lot of work. The issue is that we do not use reference counting like e.g. Java or Tcl does. We use a primitive counter called NAMED which can be 0,1, or 2, and only counts upwards. When it reaches 2, destructive modification is disallowed and the object must be copied. I.e. consider

x <- rnorm(1e6)
y <- x

at this point we actually have x and y referring to the same ~8MB block of memory. However, the semantics of R is that this is a virtual copy, so y[1] <- 1 or x[1] <- 1 entails that we duplicate the object. Fair enough, if an object is bound to multiple names, we can not modify it in place; the problem is that we lose track when the references go away, and thus,

y <- x
y[1] <- 1
x[1] <- 1

causes TWO duplications. The really nasty bit is that we very often get objects temporarily bound to two names (think about what happens with arguments in function calls).

Unfortunately, we cannot base the memory management purely on reference counting. And of course, doing so, even partially, implies that we need to have a much more concrete approach to the unbinding of objects. Notice, for instance that the names used in a function evaluation frame are not guaranteed to be unbind-able when the function exits. Something might have saved the evaluation environment, e.g. using e <<- environment() but there are also more subtle methods.


--
   O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalga...@biostat.ku.dk)              FAX: (+45) 35327907

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to