[Rd] factor() calls sort.list unnecessarily?

Martin Morgan Fri, 03 Jul 2009 08:13:51 -0700

R-devel,

factor(x) can take a long time on large character vectors (more than a
minute in the example below). This is because of a call to sort.list.


> str(x)
 chr [1:3436831] "chr5" "chr10" "chr16" "chr3" "chr4" "chr15" ...
> Rprof("/tmp/factor.Rprof")
> invisible(factor(x))
> Rprof()
> summaryRprof("/tmp/factor.Rprof")
$by.self
                 self.time self.pct total.time total.pct
"sort.list"          66.14     98.9      66.14      98.9
"unique.default"      0.26      0.4       0.26       0.4
"unique"              0.24      0.4       0.50       0.7
"match"               0.24      0.4       0.24       0.4
"factor"              0.02      0.0      66.90     100.0

$by.total
                 total.time total.pct self.time self.pct
"factor"              66.90     100.0      0.02      0.0
"sort.list"           66.14      98.9     66.14     98.9
"unique"               0.50       0.7      0.24      0.4
"unique.default"       0.26       0.4      0.26      0.4
"match"                0.24       0.4      0.24      0.4

$sampling.time
[1] 66.9

sort.list is always called but used only to determine the order of
levels, so unnecessary when levels are provided. In addition, order of
levels is for unique values of x only. Perhaps these issues are
addressed in the patch below? It does require unique() on the original
argument x, rather than only on as.character(x) At the least, perhaps
sort.list can be called only when levels are not provided?

Martin

Index: src/library/base/R/factor.R
===================================================================
--- src/library/base/R/factor.R (revision 48892)
+++ src/library/base/R/factor.R (working copy)
@@ -18,12 +18,13 @@
                    exclude = NA, ordered = is.ordered(x))
 {
     exclude <- as.vector(exclude, typeof(x))
-    ind <- sort.list(x) # or ?  order(x) which more (too ?) tolerant
+    if (missing(levels))
+        ind <- sort.list(unique(x))
     nx <- names(x)
     force(ordered)
     x <- as.character(x)
     if(missing(levels)) # get unique levels ordered by the original values
-       levels <- unique(x[ind])
+       levels <- unique(x)[ind]
     levels <- levels[is.na(match(levels, exclude))]
     f <- match(x, levels)
     if(!is.null(nx))

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] factor() calls sort.list unnecessarily?

Reply via email to