Gabor Grothendieck wrote: > I suspect strapply is only relatively slow on short strings where > it doesn't matter anyways since for long strings performance would > likely be dominated by the underlying regexp operations. I know that > users are using the package for very long strings since I once had > to lift the 25,000 character limit since I had complaints about that. > The expressiveness and brevity of strapply (it would be shortest if it > were not for the length of the word simplify) offset any disadvantage > in my view. > ok, the attached tests against strings of length 30000 where the character that matches is precisely the last one. (gabor3 is dummy, because i had no patience to wait over a minute...) note that the strapply version is still approximately an order of magnitude slower.
with the original script and string lenght (m) set to 10000, the strapply version is two orders of magnitude slower. it might be that the test is poor, though -- design a smart test where strapply wins ;) (related to the original problem, of course.) vQ
generate = function(n, m) replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, replace=TRUE), collapse=""), sample(letters[16:17], 1), sep="")) tests = list( wacek = function(data) { p = grep("^[^pq]*p", data) list(p=data[p], q=data[-p]) }, gabor1 = function(data) sapply(c(p="^[^pq]*p", q="^[^pq]*q"), grep, x=data, value=TRUE), gabor2 = function(data) tapply(data, sub("^[^pq]*p(.).*", "\\1", data), c), gabor3 = function(data) 0, # tapply(data, substr(gsub("[^pq]", "", data), 1, 1), c), gabor4 = { library(gsubfn); function(data) tapply(data, strapply(data, "^[^pq]*(.)", simplify=c), c) } ) data = generate(10,30000) for (name in names(tests)) { cat(name, ":\n", sep="") print(system.time(replicate(30,tests[[name]](data)))) }
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.