Some small points to add on this discussion: > After investigating the source of table, I ended up on the reason being > “as.character()”:
This is specifically happening within the conversion of the input to type factor, which is where the as.character conversion happens. # Timing is all on my local machine (OSX) N_v <- sample(c(1,0), 10^7, replace = TRUE) L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE) # user system elapsed system.time(table(N_v)) # 2.155 0.039 2.192 system.time(table(L_v)) # 0.806 0.030 0.838 system.time(N_fv <- as.factor(N_v)) # 2.026 0.024 2.050 system.time(L_fv <- as.factor(L_v)) # 0.668 0.015 0.683 system.time(table(N_fv)) # 0.133 0.022 0.156 system.time(table(L_fv)) # 0.134 0.018 0.151 > The performance for Integers and specially booleans is quite surprising. Of note is that the performance is significantly better if using `tabulate`, since this doesn't involve a conversion to factor (though input must be numeric/factor, results aren't named, and it has worse handling of NA values). If you have performance critical calls like this you could consider using `tabulate` instead. system.time(tabulate(N_v)) # 0.054 0.002 0.056 system.time(tabulate(as.integer(L_v))) # 0.052 0.002 0.055 I don't know if this is a known issue or not; most of my colleagues are aware of the slow-down and use `tabulate` when performance is required. My understanding was that the slower performance is a trade-off for more consistent performance (better output, better handling of ambiguities/NA, etc.), and that speed isn't the highest priority with `table`. Maybe someone else has a better understanding of the history of the function. As for improving the speed, it would basically come down to refactoring `table` to not use a `factor` conversion. I'd be concerned about introducing a lot of edge cases with that, but it's theoretically possible. Based on 30 seconds of thinking, it may be possible to do something like: ## just a sketch of a barebones non-factor implementation test_tab <- function(x){ lookup <- unique(x) counts <- tabulate(match(x, lookup)) names(counts) <- as.character(lookup) counts } system.time(test_tab(L_v)) # 0.101 0.006 0.107 system.time(test_tab(N_v)) # 0.129 0.015 0.144 This is also faster in the case where there are lots of categories with few entries per category: N_v2 <- 1:1e7 system.time(test_tab(N_v2)) # 0.383 0.024 0.411 system.time(table(N_v2)) # 6.122 0.228 6.398 Obviously there are some big shortcomings: - it's missing a lot of error checking etc. that the standard `table` has - it only works with 1D vectors - NA handling isn't quite the same as `table` (though it would be easy to adapt) Just including to potentially start discussion for optimization. For reference, the relevant section is in src/library/base/R/table.R:L75-85 -Aidan ----------------------- Aidan Lakshman (he/him) http://www.ahl27.com/ On 21 Mar 2025, at 8:26, Karolis Koncevičius wrote: > [You don't often get email from karolis.koncevic...@gmail.com. Learn why this > is important at https://aka.ms/LearnAboutSenderIdentification ] > > I was calling table() on some long logical vectors and noticed that it took a > long time. > > Out of curiosity I checked the performance of table() on different types, and > had some unexpected results: > > C <- sample(c("yes", "no"), 10^7, replace = TRUE) > F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE)) > N <- sample(c(1,0), 10^7, replace = TRUE) > I <- sample(c(1L,0L), 10^7, replace = TRUE) > L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE) > > # ordered by execution time > # user system elapsed > system.time(table(F)) # 0.088 0.006 0.093 > system.time(table(C)) # 0.208 0.017 0.224 > system.time(table(I)) # 0.242 0.019 0.261 > system.time(table(L)) # 0.665 0.015 0.680 > system.time(table(N)) # 1.771 0.019 1.791 > > > The performance for Integers and specially booleans is quite surprising. > After investigating the source of table, I ended up on the reason being > “as.character()”: > > system.time(as.character(L)) > user system elapsed > 0.461 0.002 0.462 > > Even a manual conversion can achieve a speed-up by a factor of ~7: > > system.time(c("FALSE", "TRUE")[L+1]) > user system elapsed > 0.061 0.006 0.067 > > > Tested on 4.4.3 as well as devel trunk. > > Just reporting for comments and attention. > Karolis K. > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel