Re: [R] R badly lags matlab on performance?

Peter Dalgaard Sun, 04 Jan 2009 14:32:30 -0800

Stavros Macrakis wrote:

On Sat, Jan 3, 2009 at 7:02 PM,  <l...@stat.uiowa.edu> wrote:

R's interpreter is fairly slow due in large part to the allocation of
argument lists and the cost of lookups of variables, including ones
like [<- that are assembled and looked up as strings on every call.


Wow, I had no idea the interpreter was so awful. Just some simple
tree-to-tree transformations would speed things up, I'd think, e.g.
`<-`(`[`(...), ...) ==> `<-[`(...,...).


Doesn't really help (and it's not quite correct: a[2] <- 1 is equivalent to

a <- `[<-`(a, 2,  1)

with some sneakiness that assumes that the two a's are the same, so thatyou might destructively modify the second instance.)

The actual interpreter is not much of a bottleneck. There are two othermajor obstacles:


1) Things may not be what they seem

2) Insufficient control over object duplication

1) is the major impediment to compilability (look for talks/papers byLuke for further details and ideas about what to do about it). The basicissue is that at no point can you be sure that the "log" functioncalculates logarithms. It might be redefined as a side effect of theprevious expression. This is a feature of the language as such, and itis difficult to change without destroying features that people actuallyuse. The upshot is that every time we see an object name, we enter asearch along the current search path to find it's current binding.

2) is a little contentious: It is not certain how much we gain byattacking it, only that it would be a heck of a lot of work. The issueis that we do not use reference counting like e.g. Java or Tcl does. Weuse a primitive counter called NAMED which can be 0,1, or 2, and onlycounts upwards. When it reaches 2, destructive modification isdisallowed and the object must be copied. I.e. consider


x <- rnorm(1e6)
y <- x

at this point we actually have x and y referring to the same ~8MB blockof memory. However, the semantics of R is that this is a virtual copy,so y[1] <- 1 or x[1] <- 1 entails that we duplicate the object. Fairenough, if an object is bound to multiple names, we can not modify it inplace; the problem is that we lose track when the references go away,and thus,


y <- x
y[1] <- 1
x[1] <- 1

causes TWO duplications. The really nasty bit is that we very often getobjects temporarily bound to two names (think about what happens witharguments in function calls).

Unfortunately, we cannot base the memory management purely on referencecounting. And of course, doing so, even partially, implies that we needto have a much more concrete approach to the unbinding of objects.Notice, for instance that the names used in a function evaluation frameare not guaranteed to be unbind-able when the function exits. Somethingmight have saved the evaluation environment, e.g. using e <<-environment() but there are also more subtle methods.



--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalga...@biostat.ku.dk)              FAX: (+45) 35327907

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R badly lags matlab on performance?

Reply via email to