this function is supposed to canonicalize the language: --8<---------------cut here---------------start------------->8--- canonicalize.language <- function (s) { s <- tolower(s) long <- nchar(s) == 5 s[long] <- sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long]) s[nchar(s) != 2 & s != "c"] <- "unknown" s } canonicalize.language(c("aa","bb-cc","DD-abc","eee","ff_FF","C")) [1] "aa" "bb" "unknown" "unknown" "ff" "c" --8<---------------cut here---------------end--------------->8---
it does what I want it to do, but it takes 4.5 seconds on a vector of length 10,256,341 - I wonder if I might be doing something aufully stupid. I thought that sub() was slow, but my second attempt: --8<---------------cut here---------------start------------->8--- canonicalize.language <- function (s) { s <- tolower(s) good <- nchar(s) == 5 & substr(s,3,3) %in% c("_","-") s[good] <- substr(s[good],1,2) s[nchar(s) != 2 & s != "c"] <- "unknown" s } --8<---------------cut here---------------end--------------->8--- was even slower (6.4 sec). My two concerns are: 1. avoid allocating many small objects which are never collected 2. run fast Which would be the best implementation? Thanks a lot for your insight! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://think-israel.org http://openvotingconsortium.org http://memri.org http://camera.org http://truepeace.org WHO ATE MY BREAKFAST PANTS? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.