Here is another variant, v3, and a change to your first example so it returns the same value as your second example.
> set.seed(1001) > x <- sapply(1:100, function(x)paste0(sample(letters,rpois(1,1e5),rep=TRUE),collapse = "")) > system.time(v1 <- lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1) user system elapsed 0.47 0.00 0.49 > system.time(v2 <- nchar(gsub("[^a]", "", x))) user system elapsed 2.53 0.00 2.53 > system.time(v3 <- nchar(x) - nchar(gsub("a", "", x, fixed=TRUE))) user system elapsed 0.08 0.00 0.08 > > all.equal(v1,v2) [1] TRUE > all.equal(v1,v3) [1] TRUE Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Nov 14, 2016 at 12:23 PM, Bert Gunter <bgunter.4...@gmail.com> wrote: > Chuck, Marc, and anyone else who still has interest in this odd little > discussion ... > > Yes, and with fixed = TRUE my approach took 1/3 as much time as > Chuck's with a 10 element vector each element of which is a character > string of length 1e5: > > > set.seed(1001) > > x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse > = "")) > > > system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - > 1)) > user system elapsed > 0.012 0.000 0.012 > > system.time(nchar(gsub("[^a]", "", x,fixed = TRUE))) > user system elapsed > 0.004 0.000 0.004 > > Best, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccbe...@ucsd.edu> > wrote: > > On Mon, 14 Nov 2016, Marc Schwartz wrote: > > > >> > >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccbe...@ucsd.edu> > wrote: > >>> > >>> On Mon, 14 Nov 2016, Bert Gunter wrote: > >>> > > [stuff deleted] > > > > > >> Hi, > >> > >> Both gsub() and strsplit() are using regex based pattern matching > >> internally. That being said, they are ultimately calling .Internal > code, so > >> both are pretty fast. > >> > >> For comparison: > >> > >> ## Create a 1,000,000 character vector > >> set.seed(1) > >> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") > >> > >>> nchar(Vec) > >> > >> [1] 1000000 > >> > >> ## Split the vector into single characters and tabulate > >>> > >>> table(strsplit(Vec, split = "")[[1]]) > >> > >> > >> a b c d e f g h i j k l > >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 > >> m n o p q r s t u v w x > >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 > >> y z > >> 38265 38299 > >> > >> > >> ## Get just the count of "a" > >>> > >>> table(strsplit(Vec, split = "")[[1]])["a"] > >> > >> a > >> 38664 > >> > >>> nchar(gsub("[^a]", "", Vec)) > >> > >> [1] 38664 > >> > >> > >> ## Check performance > >>> > >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) > >> > >> user system elapsed > >> 0.100 0.007 0.107 > >> > >>> system.time(nchar(gsub("[^a]", "", Vec))) > >> > >> user system elapsed > >> 0.270 0.001 0.272 > >> > >> > >> So, the above would suggest that using strsplit() is somewhat faster > than > >> using gsub(). However, as Chuck notes, in the absence of more exhaustive > >> benchmarking, the difference may or may not be more generalizable. > > > > > > > > Whether splitting on fixed strings rather than treating them as > > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on > > what you split: > > > > First repeating what Marc did... > > > >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) > > > > user system elapsed > > 0.132 0.010 0.139 > >> > >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) > > > > user system elapsed > > 0.130 0.010 0.138 > > > > ... fixed=TRUE hardly matters. But the idiom I proposed... > > > >> system.time(sum(lengths(strsplit(paste0("X", Vec, > "X"),"a",fixed=TRUE)) - > >> 1)) > > > > user system elapsed > > 0.017 0.000 0.018 > >> > >> system.time(sum(lengths(strsplit(paste0("X", Vec, > "X"),"a",fixed=FALSE)) - > >> 1)) > > > > user system elapsed > > 0.104 0.000 0.104 > >> > >> > > > > ... is 5 times faster with fixed=TRUE for this case. > > > > This result matchea Marc's count: > > > >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) > > > > [1] 38664 > >> > >> > > > > Chuck > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.