Re: [R] what is the faster way to search for a pattern in a few million entries data frame ?

Martin Morgan Sun, 10 Apr 2016 16:34:18 -0700


On 04/10/2016 03:27 PM, Fabien Tarrade wrote:

Hi Duncan,

Didn't you post the same question yesterday?  Perhaps nobody answered
because your question is unanswerable.

sorry, I got a email that my message was waiting for approval and when I
look at the forum I didn't see my message and this is why  I sent it
again and this time I did check that the format of my message was text
only. Sorry for the noise.

You need to describe what the strings are like and what the patterns
are like if you want advice on speeding things up.

my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words)
and I am searching for the frequency in my DF of the strings starting
with a sequence of few words.

I guess these days it is standard to use DF with millions of entries so
I was wondering how people are doing that in the faster way.


I did this to generate and search 40 million unique strings

> grams <- as.character(1:4e7)        ## a long time passes...
> system.time(grep("^900001", grams)) ## similar times to grepl
   user  system elapsed
 10.384   0.168  10.543

Is that the basic task you're trying to accomplish? grep(l) goes quicklyto C, so I don't think data.table or other will be markedly faster ifyou're looking for an arbitrary regular expression (use fixed=TRUE iflooking for an exact match).

If you're looking for strings that start with a pattern, then in R-3.3.0there is


> system.time(res0 <- startsWith(grams, "900001"))
   user  system elapsed
  0.658   0.012   0.669

which returns the same result as grepl

> identical(res0, res1 <- grepl("^900001", grams))
[1] TRUE

One can also parallelize the already vectorized grepl function withparallel::pvec, with some opportunity for gain (compared to grepl) onnon-Windows

> system.time(res2 <- pvec(seq_along(grams), function(i)grepl("^900001", grams[i]), mc.cores=8))

   user  system elapsed
 24.996   1.709   3.974
> identical(res0, res2)
[[1]] TRUE

I think anything else would require pre-processing of some kind, andthen some more detail about what your data looks like is required.


Martin Morgan


Thanks
Cheers
Fabien



This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] what is the faster way to search for a pattern in a few million entries data frame ?

Reply via email to