After grep -P found first match, TEXTBIN_UNKNOWN optimizations is not used. Therefore, if grep -P found early match, grep -P is very slow in UTF-8.
$ time -p grep -P ^1$ <(seq 999999) 1 real 14.55 user 13.77 sys 1.12 Or grep -Pa is not used TEXTBIN_UNKNOWN optimizations. Therefere, it is also very slow in UTF-8. grep -P ^1$ <(seq 999999) $ time -p grep -Pa a <(seq 999999) real 14.53 user 13.65 sys 1.35 This change makes deference to leave TEXTBIN_UNKNOWN optimizations until grep -P finds a binary character. It will bring more than 10x speed up. $ time -p src/grep -P ^1$ <(seq 999999) 1 real 0.97 user 0.79 sys 0.24 $ time -p src/grep -Pa a <(seq 999999) real 0.98 user 0.23 sys 0.99 BTW, this change conflicts with proposal in bug#22028.
From 2cf98594e1b7ce7490d0b6d7551f52d65ccd44a4 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <nori...@kcn.ne.jp> Date: Thu, 26 Nov 2015 15:34:13 +0900 Subject: [PATCH] grep: improve performance for grep -P in UTF-8 grep -P uses line by line search after found first match or specified -a option, but it is very slow. This change also tries to use multi-line search after them until found not text character. * src/grep.c (grep): Do it. * NEWS: Mention it. --- NEWS | 6 ++++++ src/grep.c | 28 ++++++++++++++-------------- 2 files changed, 20 insertions(+), 14 deletions(-) diff --git a/NEWS b/NEWS index ac632d7..a9a7042 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,12 @@ GNU grep NEWS -*- outline -*- * Noteworthy changes in release ?.? (????-??-??) [?] +** Improvements + + Performance has improved for grep -P in UTF-8. Before, commands + like the following would speed up more than 10x: + grep -P ^1$ <(seq 999999) + grep -aP a <(seq 999999) * Noteworthy changes in release 2.22 (2015-11-01) [stable] diff --git a/src/grep.c b/src/grep.c index 2c5e09a..a1ee183 100644 --- a/src/grep.c +++ b/src/grep.c @@ -1345,7 +1345,7 @@ grep (int fd, struct stat const *st) return 0; } - if (binary_files == TEXT_BINARY_FILES) + if (binary_files == TEXT_BINARY_FILES && execute != Pexecute) textbin = TEXTBIN_TEXT; else { @@ -1415,13 +1415,8 @@ grep (int fd, struct stat const *st) } /* Detect whether leading context is adjacent to previous output. */ - if (lastout) - { - if (textbin == TEXTBIN_UNKNOWN) - textbin = TEXTBIN_TEXT; - if (beg != lastout) - lastout = 0; - } + if (beg != lastout) + lastout = NULL; /* Handle some details and read more data to scan. */ save = residue + lim - beg; @@ -1442,12 +1437,17 @@ grep (int fd, struct stat const *st) enum textbin tb = buffer_textbin (bufbeg, buflim - bufbeg); if (textbin_is_binary (tb)) { - if (binary_files == WITHOUT_MATCH_BINARY_FILES) - return 0; - textbin = tb; - done_on_match = out_quiet = true; - nul_zapper = eol; - skip_nuls = skip_empty_lines; + if (nlines || binary_files == TEXT_BINARY_FILES) + textbin = TEXTBIN_TEXT; + else + { + if (binary_files == WITHOUT_MATCH_BINARY_FILES) + return 0; + textbin = tb; + done_on_match = out_quiet = true; + nul_zapper = eol; + skip_nuls = skip_empty_lines; + } } } } -- 2.4.6