Hi Demi,

On 2020-07-10 22:42, Demi M. Obenour wrote:
On 2020-06-23 22:29, Jordan Geoghegan wrote:
Hello,

I was working on a couple POSIX regular expressions to search for and validate 
IPv4 and IPv6 addresses with optional CIDR blocks, and encountered some strange 
behaviour from the base system grep.

I wanted to validate my regex against a list of every valid IPv4 address, so I 
generated a list with a zsh 1 liner:

      for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done | tr 
'[:space:]' '\n' > IPv4.txt

My intentions were to test the regex by running it with 'grep -c' to confirm 
there was indeed 2^32 addresses matched, and I also wanted to benchmark and 
compare performance between BSD grep, GNU grep and ripgrep. The command I used:

    grep -Eoc 
"((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"

My findings were surprising. Both GNU grep and ripgrep were able get through 
the file in roughly 10 and 20 minutes respectively, whereas the base system 
grep took over 20 hours! What interested me the most was that the base system 
grep when run with '-c' returned '0' for match count. It seems that 'grep -c' 
will have its counter overflow if there are more than 2^32-1 matches 
(4294967295) and then the counter will start counting from zero again for 
further matches.
Does OpenBSD use an NFA/DFA regular expression implementation, or a
backtracking one?  If it uses the latter, then your regex is probably
causing catastrophic backtracking.

Regards,

Jordan
Sincerely,

Demi


Ya you're probably right, I know GNU grep does a whole bunch of fancy optimizations that aren't done in BSD grep, the old GNU grep maintainer did a small write up on the FreeBSD mailing list about some of their tricks:
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

My IPv6 regex causes grep to randomly segfault on MacOS (which uses an older version of the BSD/FreeGrep that OpenBSD uses I believe), so I imagine there's something going on under the hood with these regexes making grep thrash hard.


Regards,

Jordan

Reply via email to