Hi Matt,
I think there are some confusing factors in your results.
system.time(strcmp(strings[-1], strings[-1e5]))
would also include the time required to perform both subscripting
(strings[-1] and strings[-1e5] ) which actually "takes some time".
Also, you do have a bit of overhead due to the use of STRING_ELT and the
write barrier.
I've include below a version that uses R internals so that you get the
fast (but you have to understand the risks, etc ...) version of
STRING_ELT using the plugin system of inline.
library(inline)
code <- "
SEXP ans;
int i, len, *cans;
if(!isString(s1) || !isString(s2))
error(\"invalid arguments\");
len = length(s1)>length(s2)?length(s2):length(s1);
PROTECT(ans = allocVector(INTSXP, len));
cans = INTEGER(ans);
for(i = 0; i < len; i++)
cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
CHAR(STRING_ELT(s2,i)));
UNPROTECT(1);
return ans;
"
sig <- signature(s1="character", s2="character")
strcmp <- cfunction(sig, code)
strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse
= ""))
lhs <- strings[-1]
rhs <- strings[-1e5]
system.time( lhs == rhs )
system.time(strcmp( lhs, rhs) )
library(inline)
settings <- getPlugin( "default" )
settings$includes <- paste( "#define USE_RINTERNALS", settings$includes,
collapse = "\n" )
code2 <- "
SEXP ans;
int i, len, *cans;
if(!isString(s1) || !isString(s2))
error(\"invalid arguments\");
len = length(s1)>length(s2)?length(s2):length(s1);
PROTECT(ans = allocVector(INTSXP, len));
cans = INTEGER(ans);
for(i = 0; i < len; i++)
cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
CHAR(STRING_ELT(s2,i)));
UNPROTECT(1);
return ans;
"
sig <- signature(s1="character", s2="character" )
strcmp2 <- cxxfunction(sig, code2, settings = settings)
system.time(strcmp2( lhs, rhs) )
I get:
$ Rscript strings.R
Le chargement a nécessité le package : methods
utilisateur système écoulé
0.002 0.000 0.002
utilisateur système écoulé
0.004 0.000 0.005
utilisateur système écoulé
0.003 0.000 0.003
Romain
Le 13/07/10 15:24, Matt Shotwell a écrit :
On Tue, 2010-07-13 at 01:42 -0400, Hadley Wickham wrote:
strings<- replicate(1e5, paste(sample(letters, 100, rep = T), collapse = ""))
system.time(strings[-1] == strings[-1e5])
# user system elapsed
# 0.016 0.000 0.017
So it takes ~1/100 of a second to do ~100,000 string comparisons. You
need to provide a reproducible example that illustrates why you think
string comparisons are slow.
Here's a vectorized alternative to '==' for strings, with minimal
argument checking or result conversion. I haven't looked at the
corresponding R source code, it may be similar:
library(inline)
code<- "
SEXP ans;
int i, len, *cans;
if(!isString(s1) || !isString(s2))
error(\"invalid arguments\");
len = length(s1)>length(s2)?length(s2):length(s1);
PROTECT(ans = allocVector(INTSXP, len));
cans = INTEGER(ans);
for(i = 0; i< len; i++)
cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
CHAR(STRING_ELT(s2,i)));
UNPROTECT(1);
return ans;
"
sig<- signature(s1="character", s2="character")
strcmp<- cfunction(sig, code)
system.time(strings[-1] == strings[-1e5])
user system elapsed
0.036 0.000 0.035
system.time(strcmp(strings[-1], strings[-1e5]))
user system elapsed
0.032 0.000 0.034
That's pretty fast, though I seem to be working with a slower system
than Hadley. It's hard to see how this could be improved, except maybe
by caching results of string comparisons.
-Matt
Hadley
On Tue, Jul 13, 2010 at 6:52 AM, Ralf B<ralf.bie...@gmail.com> wrote:
I am asking this question because String comparison in R seems to be
awfully slow (based on profiling results) and I wonder if perhaps '=='
alone is not the best one can do. I did not ask for anything
particular and I don't think I need to provide a self-contained source
example for the question. So, to re-phrase my question, are there more
(runtime) effective ways to find out if two strings (about 100-150
characters long) are equal?
Ralf
On Sun, Jul 11, 2010 at 2:37 PM, Sharpie<ch...@sharpsteen.net> wrote:
Ralf B wrote:
What is the fastest way to compare two strings in R?
Ralf
Which way is not fast enough?
In other words, are you asking this question because profiling showed one of
R's string comparison operations is causing a massive bottleneck in your
code? If so, which one and how are you using it?
-Charlie
--
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/bc8jNi : Rcpp 0.8.4
|- http://bit.ly/dz0RlX : bibtex 0.2-1
`- http://bit.ly/a5CK2h : Les estivales 2010
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.