[R] please comment on my function

Sam Steingold Fri, 14 Sep 2012 09:34:18 -0700

this function is supposed to canonicalize the language:

--8<---------------cut here---------------start------------->8---
canonicalize.language <- function (s) {
  s <- tolower(s)
  long <- nchar(s) == 5
  s[long] <- sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long])
  s[nchar(s) != 2 & s != "c"] <- "unknown"
  s
}
canonicalize.language(c("aa","bb-cc","DD-abc","eee","ff_FF","C"))
[1] "aa"      "bb"      "unknown" "unknown" "ff"      "c"  
--8<---------------cut here---------------end--------------->8---


it does what I want it to do, but it takes 4.5 seconds on a vector of
length 10,256,341 - I wonder if I might be doing something aufully stupid.
I thought that sub() was slow, but my second attempt:
--8<---------------cut here---------------start------------->8---
canonicalize.language <- function (s) {
  s <- tolower(s)
  good <- nchar(s) == 5 & substr(s,3,3) %in% c("_","-")
  s[good] <- substr(s[good],1,2)
  s[nchar(s) != 2 & s != "c"] <- "unknown"
  s
}
--8<---------------cut here---------------end--------------->8---
was even slower (6.4 sec).

My two concerns are:

1. avoid allocating many small objects which are never collected
2. run fast

Which would be the best implementation?

Thanks a lot for your insight!

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://think-israel.org http://openvotingconsortium.org
http://memri.org http://camera.org http://truepeace.org
WHO ATE MY BREAKFAST PANTS?

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] please comment on my function

Reply via email to