Re: [Rd] table() and as.character() performance for logical values

Sebastian Meyer Mon, 24 Mar 2025 05:00:04 -0700

Am 21.03.25 um 15:42 schrieb Aidan Lakshman via R-devel:

After investigating the source of table, I ended up on the reason being 
“as.character()”:


This is specifically happening within the conversion of the input to type 
factor, which is where the as.character conversion happens.

Yes, I also think 'factor' could do a bit better for unclassed integers(such as when called from 'cut') as well as for logical input (such asfrom 'summary' -> 'table').

Note that 'as.factor' already has a "fast track" for plain integers(originally for 'split.default' from 'tapply'), so can be used insteadof 'factor' when there is no need for custom 'levels', 'labels', or'exclude'. (Thanks for already mentioning 'tabulate'.)


A 'factor' patch would apply more broadly, e.g.:

===================================================================
--- src/library/base/R/factor.R (Revision 88042)
+++ src/library/base/R/factor.R (Arbeitskopie)
@@ -20,14 +20,18 @@
                    exclude = NA, ordered = is.ordered(x), nmax = NA)
 {
     if(is.null(x)) x <- character()
+    directmatch <- !is.object(x) &&
+        (is.character(x) || is.integer(x) || is.logical(x))
     nx <- names(x)
     if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- order(y)
-       levels <- unique(as.character(y)[ind])
+        if (!directmatch)
+            y <- as.character(y)
+       levels <- unique(y[ind])
     }
     force(ordered) # check if original x is an ordered factor
-    if(!is.character(x))
+    if(!directmatch)
        x <- as.character(x)
     ## levels could be a long vector, but match will not handle that.
     levels <- levels[is.na(match(levels, exclude))]
     f <- match(x, levels)
===================================================================

This skips as.character() also for integer/logical 'x' and would indeedbring table() runtimes "in order":


    set.seed(1)
    C <- sample(c("no", "yes"), 10^7, replace = TRUE)
    F <- as.factor(C)
    L <- F == "yes"
    I <- as.integer(L)
    N <- as.numeric(I)

    ## Median system.time(table(.)) in ms:
    ## table(F)   256
    ## table(I)   384   # not  696
    ## table(L)   409   # not 1159
    ## table(C)   591
    ## table(N)  3324

The (seemingly) small patch passes check-all, but maybe it overlookssome edge cases. I'd test it on a subset of CRAN/BIOC packages.


Best,

        Sebastian Meyer


   # Timing is all on my local machine (OSX)
   N_v <- sample(c(1,0), 10^7, replace = TRUE)
   L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
                                          #  user  system elapsed
   system.time(table(N_v))                # 2.155   0.039   2.192
   system.time(table(L_v))                # 0.806   0.030   0.838

   system.time(N_fv <- as.factor(N_v))    # 2.026   0.024   2.050
   system.time(L_fv <- as.factor(L_v))    # 0.668   0.015   0.683

   system.time(table(N_fv))               # 0.133   0.022   0.156
   system.time(table(L_fv))               # 0.134   0.018   0.151

The performance for Integers and specially booleans is quite surprising.


Of note is that the performance is significantly better if using `tabulate`, 
since this doesn't involve a conversion to factor (though input must be 
numeric/factor, results aren't named, and it has worse handling of NA values). 
If you have performance critical calls like this you could consider using 
`tabulate` instead.

   system.time(tabulate(N_v))             # 0.054   0.002   0.056
   system.time(tabulate(as.integer(L_v))) # 0.052   0.002   0.055


I don't know if this is a known issue or not; most of my colleagues are aware 
of the slow-down and use `tabulate` when performance is required. My 
understanding was that the slower performance is a trade-off for more 
consistent performance (better output, better handling of ambiguities/NA, 
etc.), and that speed isn't the highest priority with `table`. Maybe someone 
else has a better understanding of the history of the function.

As for improving the speed, it would basically come down to refactoring `table` 
to not use a `factor` conversion. I'd be concerned about introducing a lot of 
edge cases with that, but it's theoretically possible. Based on 30 seconds of 
thinking, it may be possible to do something like:

## just a sketch of a barebones non-factor implementation
   test_tab <- function(x){
     lookup <- unique(x)
     counts <- tabulate(match(x, lookup))
     names(counts) <- as.character(lookup)
     counts
   }

   system.time(test_tab(L_v))  # 0.101   0.006   0.107
   system.time(test_tab(N_v))  # 0.129   0.015   0.144

This is also faster in the case where there are lots of categories with few 
entries per category:

   N_v2 <- 1:1e7
   system.time(test_tab(N_v2)) # 0.383   0.024   0.411
   system.time(table(N_v2))    # 6.122   0.228   6.398

Obviously there are some big shortcomings:
- it's missing a lot of error checking etc. that the standard `table` has
- it only works with 1D vectors
- NA handling isn't quite the same as `table` (though it would be easy to adapt)

Just including to potentially start discussion for optimization.

For reference, the relevant section is in src/library/base/R/table.R:L75-85

-Aidan

-----------------------
Aidan Lakshman (he/him)
http://www.ahl27.com/

On 21 Mar 2025, at 8:26, Karolis Koncevičius wrote:

[You don't often get email from [email protected]. Learn why this 
is important at https://aka.ms/LearnAboutSenderIdentification ]

I was calling table() on some long logical vectors and noticed that it took a 
long time.

Out of curiosity I checked the performance of table() on different types, and 
had some unexpected results:

     C <- sample(c("yes", "no"), 10^7, replace = TRUE)
     F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE))
     N <- sample(c(1,0), 10^7, replace = TRUE)
     I <- sample(c(1L,0L), 10^7, replace = TRUE)
     L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)

                            # ordered by execution time
                            #   user  system elapsed
     system.time(table(F))  #  0.088   0.006   0.093
     system.time(table(C))  #  0.208   0.017   0.224
     system.time(table(I))  #  0.242   0.019   0.261
     system.time(table(L))  #  0.665   0.015   0.680
     system.time(table(N))  #  1.771   0.019   1.791


The performance for Integers and specially booleans is quite surprising.
After investigating the source of table, I ended up on the reason being 
“as.character()”:

     system.time(as.character(L))
      user  system elapsed
     0.461   0.002   0.462

Even a manual conversion can achieve a speed-up by a factor of ~7:

     system.time(c("FALSE", "TRUE")[L+1])
      user  system elapsed
     0.061   0.006   0.067


Tested on 4.4.3 as well as devel trunk.

Just reporting for comments and attention.
Karolis K.
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] table() and as.character() performance for logical values

Reply via email to