I've been working on an R performance academic project for the last couple years which has involved writing an interpreter for R from scratch and a JIT for R vector operations.
With the recent comments on Julia, I thought I'd share some thoughts from my experience since they differ substantially from the common speculation on R performance. I went into the project thinking that R would be slow for the commonly cited reasons: NAs, call-by-value, immutable values, ability to dynamically add/remove variables from environments, etc. But this is largely *not* true. It does require being somewhat clever, but most of the cost of these features can be either eliminated or moved to uncommon cases that won't affect most code. And there's plenty of room for innovation here. The history of Javascript runtimes over the last decade has shown that dramatic performance improvements are possible even for difficult languages. This is good news. I think we can keep essentially everything that people like about R and still achieve great performance. So why is R performance poor now? I think the fundamental reason is related to software engineering: R is nearly impossible to experiment with, so no one tries out new performance techniques on it. There are two main issues here: 1) The R Language Definition doesn't get enough love. I could point out plenty of specific problems, omissions, etc., but I think the high-level problem is that the Language Definition currently conflates three things: 1) the actual language definition, 2) the definition of what is more properly the standard library, and 3) the implementation. This conflation hides how simple the R/S language actually is and, by assuming that the current implementation is the only implementation, obscures performance improvements that could be made by changing the implementation. 2) The R core implementation (e.g. everything in src/main) is too big. There are ~900 functions listed in names.c. This has got to be simply unmanageable. If one were to change the SEXP representation, how many internal functions would have to be checked and updated? This is a severe hinderance on improving performance. I see little value is debating changes to the language semantics until we've addressed this low hanging fruit and at least tried to make the current R/S semantics run fast. Justin ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel