Good idea Romain, there is quite a bit of type testing in the function versions of STRING_ELT and CHAR, not to mention the function call overhead. Since the types are checked explicitly, I believe this function is safe. All together now...
> system.time(strings[-1] == strings[-1e5]) user system elapsed 0.032 0.000 0.035 > system.time(strcmp(strings[-1], strings[-1e5])) user system elapsed 0.032 0.000 0.034 > system.time(strcmp2(strings[-1], strings[-1e5])) user system elapsed 0.024 0.000 0.026 > system.time(lhs==rhs) user system elapsed 0.012 0.000 0.013 > system.time(strcmp(lhs, rhs)) user system elapsed 0.012 0.000 0.011 > system.time(strcmp2(lhs, rhs)) user system elapsed 0.004 0.000 0.004 I looks like you can squeeze out more speed using the macro versions of STRING_ELT and CHAR. On Tue, 2010-07-13 at 09:48 -0400, Romain Francois wrote: > Hi Matt, > > I think there are some confusing factors in your results. > > > system.time(strcmp(strings[-1], strings[-1e5])) > > would also include the time required to perform both subscripting > (strings[-1] and strings[-1e5] ) which actually "takes some time". > > > Also, you do have a bit of overhead due to the use of STRING_ELT and the > write barrier. > > > I've include below a version that uses R internals so that you get the > fast (but you have to understand the risks, etc ...) version of > STRING_ELT using the plugin system of inline. > > library(inline) > code <- " > SEXP ans; > int i, len, *cans; > if(!isString(s1) || !isString(s2)) > error(\"invalid arguments\"); > len = length(s1)>length(s2)?length(s2):length(s1); > PROTECT(ans = allocVector(INTSXP, len)); > cans = INTEGER(ans); > for(i = 0; i < len; i++) > cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\ > CHAR(STRING_ELT(s2,i))); > UNPROTECT(1); > return ans; > " > sig <- signature(s1="character", s2="character") > strcmp <- cfunction(sig, code) > > strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse > = "")) > > > lhs <- strings[-1] > rhs <- strings[-1e5] > system.time( lhs == rhs ) > system.time(strcmp( lhs, rhs) ) > > library(inline) > settings <- getPlugin( "default" ) > settings$includes <- paste( "#define USE_RINTERNALS", settings$includes, > collapse = "\n" ) > code2 <- " > SEXP ans; > int i, len, *cans; > if(!isString(s1) || !isString(s2)) > error(\"invalid arguments\"); > len = length(s1)>length(s2)?length(s2):length(s1); > PROTECT(ans = allocVector(INTSXP, len)); > cans = INTEGER(ans); > for(i = 0; i < len; i++) > cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\ > CHAR(STRING_ELT(s2,i))); > UNPROTECT(1); > return ans; > " > sig <- signature(s1="character", s2="character" ) > strcmp2 <- cxxfunction(sig, code2, settings = settings) > system.time(strcmp2( lhs, rhs) ) > > > > I get: > > $ Rscript strings.R > Le chargement a nécessité le package : methods > utilisateur système écoulé > 0.002 0.000 0.002 > utilisateur système écoulé > 0.004 0.000 0.005 > utilisateur système écoulé > 0.003 0.000 0.003 > > Romain > > > Le 13/07/10 15:24, Matt Shotwell a écrit : > > > > On Tue, 2010-07-13 at 01:42 -0400, Hadley Wickham wrote: > >> strings<- replicate(1e5, paste(sample(letters, 100, rep = T), collapse = > >> "")) > >> system.time(strings[-1] == strings[-1e5]) > >> # user system elapsed > >> # 0.016 0.000 0.017 > >> > >> So it takes ~1/100 of a second to do ~100,000 string comparisons. You > >> need to provide a reproducible example that illustrates why you think > >> string comparisons are slow. > > > > Here's a vectorized alternative to '==' for strings, with minimal > > argument checking or result conversion. I haven't looked at the > > corresponding R source code, it may be similar: > > > > library(inline) > > code<- " > > SEXP ans; > > int i, len, *cans; > > if(!isString(s1) || !isString(s2)) > > error(\"invalid arguments\"); > > len = length(s1)>length(s2)?length(s2):length(s1); > > PROTECT(ans = allocVector(INTSXP, len)); > > cans = INTEGER(ans); > > for(i = 0; i< len; i++) > > cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\ > > CHAR(STRING_ELT(s2,i))); > > UNPROTECT(1); > > return ans; > > " > > sig<- signature(s1="character", s2="character") > > strcmp<- cfunction(sig, code) > > > > > >> system.time(strings[-1] == strings[-1e5]) > > user system elapsed > > 0.036 0.000 0.035 > >> system.time(strcmp(strings[-1], strings[-1e5])) > > user system elapsed > > 0.032 0.000 0.034 > > > > That's pretty fast, though I seem to be working with a slower system > > than Hadley. It's hard to see how this could be improved, except maybe > > by caching results of string comparisons. > > > > -Matt > > > >> > >> Hadley > >> > >> > >> On Tue, Jul 13, 2010 at 6:52 AM, Ralf B<ralf.bie...@gmail.com> wrote: > >>> I am asking this question because String comparison in R seems to be > >>> awfully slow (based on profiling results) and I wonder if perhaps '==' > >>> alone is not the best one can do. I did not ask for anything > >>> particular and I don't think I need to provide a self-contained source > >>> example for the question. So, to re-phrase my question, are there more > >>> (runtime) effective ways to find out if two strings (about 100-150 > >>> characters long) are equal? > >>> > >>> Ralf > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Sun, Jul 11, 2010 at 2:37 PM, Sharpie<ch...@sharpsteen.net> wrote: > >>>> > >>>> > >>>> Ralf B wrote: > >>>>> > >>>>> What is the fastest way to compare two strings in R? > >>>>> > >>>>> Ralf > >>>>> > >>>> > >>>> Which way is not fast enough? > >>>> > >>>> In other words, are you asking this question because profiling showed > >>>> one of > >>>> R's string comparison operations is causing a massive bottleneck in your > >>>> code? If so, which one and how are you using it? > >>>> > >>>> -Charlie > > -- Matthew S. Shotwell Graduate Student Division of Biostatistics and Epidemiology Medical University of South Carolina http://biostatmatt.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.