On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès <hpages.on.git...@gmail.com> wrote: > > On 03/12/2022 07:21, Bert Gunter wrote: > > Perhaps it is worth pointing out that looping constructs like lapply() can > > be avoided and the procedure vectorized by mimicking Martin Morgan's > > solution: > > > > ## s is the string to be searched. > > diff(c(0,grep('b',strsplit(s,'')[[1]]))) > > > > However, Martin's solution is simpler and likely even faster as the regex > > engine is unneeded: > > > > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized > > > > This seems much preferable to me. > > Of all the proposed solutions, Andrew Hart's solution seems the most > efficient: > > big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) > > system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) > # user system elapsed > # 0.736 0.028 0.764 > > system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] > == "b")))) > # user system elapsed > # 2.100 0.356 2.455 > > The bigger the string, the bigger the gap in performance. > > Also, the bigger the average gap between 2 successive b's, the bigger > the gap in performance. > > Finally: always use fixed=TRUE in strsplit() if you don't need to use > the regex engine.
You can do a bit better if you are willing to use stringr: library(stringr) big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) #> user system elapsed #> 0.126 0.002 0.128 system.time(str_length(str_split(big_string, fixed("b"))[[1]])) #> user system elapsed #> 0.103 0.004 0.107 (And my timings also suggest that it's time for Hervé to get a new computer :P) It feels like an approach that uses locations should be faster since you wouldn't have to construct all the intermediate strings. system.time(pos <- str_locate_all(big_string, fixed("b"))[[1]][,1]) #> user system elapsed #> 0.075 0.004 0.080 # I suspect this could be optimised with a little thought making this approach # faster overall system.time(c(0, diff(pos)) #> user system elapsed #> 0.022 0.006 0.027 Hadley -- http://hadley.nz ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.