Hi Demi,
On 2020-07-10 22:42, Demi M. Obenour wrote:
On 2020-06-23 22:29, Jordan Geoghegan wrote:
Hello,
I was working on a couple POSIX regular expressions to search for and validate
IPv4 and IPv6 addresses with optional CIDR blocks, and encountered some strange
behaviour from the base system grep.
I wanted to validate my regex against a list of every valid IPv4 address, so I
generated a list with a zsh 1 liner:
for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done | tr
'[:space:]' '\n' > IPv4.txt
My intentions were to test the regex by running it with 'grep -c' to confirm
there was indeed 2^32 addresses matched, and I also wanted to benchmark and
compare performance between BSD grep, GNU grep and ripgrep. The command I used:
grep -Eoc
"((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
My findings were surprising. Both GNU grep and ripgrep were able get through
the file in roughly 10 and 20 minutes respectively, whereas the base system
grep took over 20 hours! What interested me the most was that the base system
grep when run with '-c' returned '0' for match count. It seems that 'grep -c'
will have its counter overflow if there are more than 2^32-1 matches
(4294967295) and then the counter will start counting from zero again for
further matches.
Does OpenBSD use an NFA/DFA regular expression implementation, or a
backtracking one? If it uses the latter, then your regex is probably
causing catastrophic backtracking.
Regards,
Jordan
Sincerely,
Demi
Ya you're probably right, I know GNU grep does a whole bunch of fancy
optimizations that aren't done in BSD grep, the old GNU grep maintainer
did a small write up on the FreeBSD mailing list about some of their tricks:
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
My IPv6 regex causes grep to randomly segfault on MacOS (which uses an
older version of the BSD/FreeGrep that OpenBSD uses I believe), so I
imagine there's something going on under the hood with these regexes
making grep thrash hard.
Regards,
Jordan