Re: What to learn from the BSD grep case [Was: why GNU grep is fast]

2010-08-23 Thread C. P. Ghost
On Mon, Aug 23, 2010 at 5:04 PM, Gabor Kovesdan wrote: > 4, We really need a good regex library. From the comments, it seems there's > no such in the open source world. GNU libregex isn't efficient because GNU > grep uses those workarounds that Mike kindly pointed out. Oniguruma was > extremely sl

Re: What to learn from the BSD grep case [Was: why GNU grep is fast]

2010-08-23 Thread Scott Long
On Aug 23, 2010, at 9:04 AM, Gabor Kovesdan wrote: > Hi all, > > there are some consequences that we can see from the grep case. Here I'd like > to add a summary, which raises some questions. All comments are welcome. > > 1, When grep entered -CURRENT and bugs were found I immediately got kind

What to learn from the BSD grep case [Was: why GNU grep is fast]

2010-08-23 Thread Gabor Kovesdan
Hi all, there are some consequences that we can see from the grep case. Here I'd like to add a summary, which raises some questions. All comments are welcome. 1, When grep entered -CURRENT and bugs were found I immediately got kind bug reports and sharp criticism, as well. According to my u

Re: why GNU grep is fast

2010-08-23 Thread Gabor Kovesdan
Later on, he summarizes some of the existing implementations, including comments about the Plan 9 implementation and his own RE2, both of which efficiently handle international text (which seems to be a major concern of Gabor's). I believe Gabor is considering TRE for a good replacement re

Re: why GNU grep is fast

2010-08-23 Thread Gabor Kovesdan
Em 2010.08.21. 4:31, Mike Haertel escreveu: Hi Gabor, I am the original author of GNU grep. I am also a FreeBSD user, although I live on -stable (and older) and rarely pay attention to -current. However, while searching the -current mailing list for an unrelated reason, I stumbled across some

Re: why GNU grep is fast

2010-08-23 Thread Dag-Erling Smørgrav
Mike Haertel writes: > Dag-Erling Smørgrav writes: > > You don't really need to "isolate the containing line" unless you have > > an actual match, do you? > Theoretically no. However, suppose the pattern was /foo.*blah/. > The Boyer-Moore search will be for "blah", since that's the longest > fix

Re: why GNU grep is fast

2010-08-23 Thread Dag-Erling Smørgrav
"Sean C. Farley" writes: > Dag-Erling Smørgrav writes: > > Aho-Corasick is not really a search algorithm, but an algorithm for > > constructing a table-driven finite state machine that will match > > either of the search strings you fed it. I believe it is less > > efficient than Boyer-Moore for

Re: why GNU grep is fast

2010-08-22 Thread Sean C. Farley
On Sun, 22 Aug 2010, Tim Kientzle wrote: On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote: On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: Mike Haertel writes: GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup

Re: why GNU grep is fast

2010-08-22 Thread Sean C. Farley
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: "Sean C. Farley" writes: Some algorithms: 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm Aho-Corasick is not really a search algorithm, but an algorithm for constructing a table-driven finite state machine that will mat

Re: why GNU grep is fast

2010-08-22 Thread Mike Haertel
Dag-Erling Smørgrav writes: > Mike Haertel writes: > > For example if your regex is /foo.*bar/, the initial Boyer-Moore search > > is (probably) searching for foo. > > > > If the initial search succeeds, GNU grep isolates the containing line, > > and then runs the full regex matcher on that line

Re: why GNU grep is fast

2010-08-22 Thread Dag-Erling Smørgrav
Mike Haertel writes: > For example if your regex is /foo.*bar/, the initial Boyer-Moore search > is (probably) searching for foo. > > If the initial search succeeds, GNU grep isolates the containing line, > and then runs the full regex matcher on that line to make sure. You don't really need to "

Re: why GNU grep is fast

2010-08-22 Thread Dag-Erling Smørgrav
"Sean C. Farley" writes: > Some algorithms: > 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm Aho-Corasick is not really a search algorithm, but an algorithm for constructing a table-driven finite state machine that will match either of the search strings you fed it. I bel

Re: why GNU grep is fast

2010-08-22 Thread Mike Haertel
Dag-Erling Sm�rgrav writes: > Mike Haertel writes: > > GNU grep uses the well-known Boyer-Moore algorithm, which looks > > first for the final letter of the target string, and uses a lookup > > table to tell it how far ahead it can skip in the input whenever > > it finds a non-matching character.

Re: why GNU grep is fast

2010-08-22 Thread Tim Kientzle
On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote: > On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: >> Mike Haertel writes: >>> GNU grep uses the well-known Boyer-Moore algorithm, which looks first for >>> the final letter of the target string, and uses a lookup table to tell it >>> how far ahe

Re: why GNU grep is fast

2010-08-22 Thread Garrett Wollman
In article <20100822163644.gu2...@hoeg.nl> you write: >I think that implementing a simple fgrep boils down to mmap()ing a file >and calling memmem() on the mapping to search for the input string. Of >course this relies on having an efficient memmem() implementation, for >example using one of the a

Re: why GNU grep is fast

2010-08-22 Thread Tim Kientzle
On Aug 22, 2010, at 8:02 AM, Garrett Wollman wrote: > In article <86k4nikglg@ds4.des.no> you write: >> Mike Haertel writes: >>> GNU grep uses the well-known Boyer-Moore algorithm, which looks >>> first for the final letter of the target string, and uses a lookup >>> table to tell it how far ah

Re: why GNU grep is fast

2010-08-22 Thread Ed Schouten
* Mike Haertel wrote: > Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking > for newlines would slow grep down by a factor of several times, > because to find the newlines it would have to look at every byte! I think that implementing a simple fgrep boils down to mmap()ing a file a

Re: why GNU grep is fast

2010-08-22 Thread Sean C. Farley
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: Mike Haertel writes: GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching cha

Re: why GNU grep is fast

2010-08-22 Thread Xin LI
2010/8/22 Dag-Erling Smørgrav : > Amen.  The current bottleneck in BSD grep is the memchr() that looks for > '\n' in the input buffer. FYI I actually have a rewritten memchr() which is faster than the current one here: http://people.freebsd.org/~delphij/for_review/memchr.c Review/comments welcom

Re: why GNU grep is fast

2010-08-22 Thread Garrett Wollman
In article <86k4nikglg@ds4.des.no> you write: >Mike Haertel writes: >> GNU grep uses the well-known Boyer-Moore algorithm, which looks >> first for the final letter of the target string, and uses a lookup >> table to tell it how far ahead it can skip in the input whenever >> it finds a non-mat

Re: why GNU grep is fast

2010-08-22 Thread Dag-Erling Smørgrav
Mike Haertel writes: > GNU grep uses the well-known Boyer-Moore algorithm, which looks > first for the final letter of the target string, and uses a lookup > table to tell it how far ahead it can skip in the input whenever > it finds a non-matching character. Boyer-Moore is for fixed search strin

Re: why GNU grep is fast

2010-08-21 Thread Steven Hartland
That's a good read for other things as Mike, thanks for taking the time to pass on this knowledge :) - Original Message - From: "Mike Haertel" To: Anyway, just FYI, here's a quick summary of where GNU grep gets its speed. Hopefully you can carry these ideas over to BSD grep. #1 tr

why GNU grep is fast

2010-08-20 Thread Mike Haertel
Hi Gabor, I am the original author of GNU grep. I am also a FreeBSD user, although I live on -stable (and older) and rarely pay attention to -current. However, while searching the -current mailing list for an unrelated reason, I stumbled across some flamage regarding BSD grep vs GNU grep perform