Hi,
A pattern is coverted to lower case before compile in match icase
(grep.c:mb_icase_keys). \B and \W doesn't corrently work so that each
are converted to \b and \w.
Hi,
It seems that the problem is derived from included regex.
If you use --without-included-regex, copy system regex.h to `lib'
sub-directory and enable search.c:196.
If you --with-included-regex, copy regex from glibc (2.3 or later)
to `lib' sub-directory'
% cp glibc-2.3.6/posix/reg* grep-2.
Hi,
Searching for text of multi-byte characters, grep converts all of
then to wide characters, even parts of string which doesn't match with a
pattern as single-byte. See Bug#14472.
> and that's not done with -P, right? thanks for the response.
Only Grep with -P uses PCRE library, which doesn't understand
multi-byte locale except UTF-8.
Hi,
Unlike the VI editor (VIM), Grep doesn't automatically recognize
character set of a text. You need to set locale and character set to
LC_ALL, LANG, etc. Can Cygwin understand utf-16 ?
Hi,
> grep 2.5.4 has an undocumented \S operator:
It means it doesn't be supported by grep 2.5.4.
Grep 2.5.4 uses regex, which is included in GNU libc and supports
`\S' operand. However, Grep 2.5.4 also use own engines, which can't
interpret `\S' operand. So you mayn't use undocumented ope
> The attached patch are for grep 2.5.1a and 2.5.4.
It doesn't work as it is. I have made it work in it. Furthermore this
patch changes that fgrep and egrep don't be linked to libpcre, because
of no dependency of fgrep and egrep on libpcre.
grep-2.5.4.libpcre.patch
Description: Binary data
Hi,
> grep 2.5.4 has an undocumented \S operator:
It means it doesn't be supported by grep 2.5.4.
Grep 2.5.4 uses regex, which is included in GNU libc and supports
`\S' operand. However, Grep 2.5.4 also use own engines, which can't
interpret `\S' operand. So you mayn't use undocumented ope
Try to use included regex to invalidate \S operand, or to apply the
following patch to validate it.
grep-2.5.4.dfa-isspace.patch
Description: Binary data
Hi,
We can't use and/or escape `]' between `[' and `]' in grep and
egrep. Given cases is interpreted respectively as follows.
- grep -E "[1-\]]" file_input
[1-\] ] CAT
where [1-\] is range cset.
- grep -E "[1-\\]]" file_input
[1-\\]]CAT
where [1-\\]
Hi,
See following thread.
http://lists.gnu.org/archive/html/bug-grep/2009-05/msg9.html
>Grep 2.5.4 uses regex, which is included in GNU libc and supports
> `\S' operand. However, Grep 2.5.4 also use own engines, which can't
> interpret `\S' operand. So you mayn't use undocumented op
Hi,
By this patch, even when multibyte-check failed for a simple pattern
that doesn't contain the wild-card and the repetition expression, `dfaexec'
will have called.
Do you intend it?
Hi,
> I'm not happy with removing the null checks in calls to free(); there
> were systems out there that would throw a fatal error if you passed
> null to free(). I'd prefer to leave those checks in.
Though I also thought so first, in this case I seem it's guaranteed that
elements that is small
Hi,
When a line matches with kwset and fails in the is_mb_middle test,
bug#23814 is caused by not checking following of the line (Never looks
for the second match for the line.) .
In this case, by matching kwset for following, the bug will be solved.
For a simple pattern which doesn't contain th
Hi,
> I'm not happy with removing the null checks in calls to free(); there
> were systems out there that would throw a fatal error if you passed
> null to free(). I'd prefer to leave those checks in.
Though I also thought so first, in this case I seem it's guaranteed that
elements that is small
Hi,
Many of patch#6899 might be unnecessary, because the performance issue
as bug#14472 was improved in development release.
But I think kwsmb.patch looks still very effective.
Hi,
I think that it would be better to be corrected as follows. Please point
out if the idea is wrong.
diff -ru grep-2.5.4.183-9159-dirty.orig/src/search.c
grep-2.5.4.183-9159-dirty/src/search.c
--- grep-2.5.4.183-9159-dirty.orig/src/search.c 1970-01-01 00:00:01.0
+
+++ grep-2.5
Hi,
I tried to build grep-2.5.4.183-9159, but I was received the warning on
RHEL5. When it has not been corrected yet, I hope you correct it.
diff -ru grep-2.5.4.183-9159-dirty.orig/src/dfa.c
grep-2.5.4.183-9159-dirty/src/dfa.c
--- grep-2.5.4.183-9159-dirty.orig/src/dfa.c1970-01-01 00:00:0
Hi Paolo,
> > diff1:
> >I seem that It should match at the head of the line when start_ptr
> >isn't set.
>
> Do you have a testcase? I'm hesitant to apply this without one.
No. Though I had the test case for this patch, I have lost it...
A little time is necessary to make it.
> > dif
Hi, Jim
This problem is easily reproduced with both -w option and backref.
--
#!/bin/sh
# This would fail for grep-2.6
: ${srcdir=.}
. "$srcdir/init.sh"; path_prepend_ ../src
printf 'foo foo bar\n' > exp1 || framework_failure
fail=0
for LOC in en_US.UTF-8 zh_CN $LOCALE_FR_UTF8; do
out=ou
> > Thank you for the patch.
> > Do either of you have a test case?
>
> No, or I would have included it. But it matches what grep does in
> general to handle this case.
>
> Paolo
Thanks. Though I have no test case, I also think invalid sequence
regarded as single-byte shouldn't affect to follo
Hi,
Thank you for your advice. I have requested assignment for changes.
However it may take some time...
> In future, please consider providing patches in "git format-patch" form,
> so it's less work for us. Here are some guidelines that should help:
> (they're technically for coreutils, but app
Hi Jim,
regex included up to grep 2.5.4 didn't support RE_ICASE, so we had to
convert pattern and input to lower case in ignore-case match beforehand.
However, In grep 2.6 regex has been updated. I seem that now we no
longer need to keep it.
Hi,
I have tested grep-2.6.2. However, I seem the fix for prefix of
multibyte is insufficient.
Please run following test case.
--
#!/bin/sh
# This would mistakenly print a line prior to grep-2.6.2.
: ${srcdir=.}
. "$srcdir/init.sh"; path_prepend_ ../src
encode() { echo "$1" | tr ABC '\357\274\2
Hi,
I seem that is expected behavior. [A-Z] includes A,b,B,c,C,...y,Y,z,Z
in en_US locale (not include `a').
Do you regard sizeof (int) as 32-bit ?
If CHARCLASS_INTS == 4, we mayn't be able to compile following code
correctly.
> + static const charclass utf8_classes[5] = {
> + { 0, 0, 0, 0, ~0, ~0, 0, 0 },/* 80-bf: non-lead bytes
> */
> + { ~0, ~0, ~0, ~0, 0, 0, 0, 0 },
Try following.
$ CPPFLAGS=-I/usr/local/include \
LD_FLAGS=-L/usr/local/lib/hpux32 \
./configure --without-libiconv-prefix --without-libintl-prefix
Package: grep
Tags: patch
Case-insensitive matching is expensive in multi-byte locales because of
conversion of targeted text to lower case.
However, I seem that awk which uses dfa.c as well as grep don't covert
target text to lower case. I seem that if grep don't use kwset, it
doesn't also have
I'm sorry the content of the attachment is incorrect.
I send the correct file.
grep-ignore-icase.txt
Description: Binary data
Package: grep
Tags: patch
When matched characters to a regular expression is found by kwsexec or
dfaexec, we need check whether it is in the middle of a multi-byte character.
`is_mb_middle' of searchutils.c is used for it. However, it's expensive,
even if most of them contain constitute with sing
Hi Jim,
I thank you for your review for the patch.
I have any requests of any changes for the modified comments and commit
log.
However, can you merge an additional patch, which is attached on this
mail, into the commit? No longer `kwsincr_case' is called with
case-insensitive matching in a mul
Sorry, you are right. the declaration of kwset_exact_matches shouldn't
be removed.
I'm sorry that I don't test the patch sufficiently.
I fixed several bugs in the patch. In addition to the patch, I attach
the results of the compile and the performance test.
is_mb_middle.txt
Description: Binary data
make.txt
Description: Binary data
test.txt
Description: Binary data
Hi Paul,
Thank you for reviewing tha patch.
> Please use something like this instead
All right.
> A minor question about naming: in what sense is mbclen_guess a guess?
Because mbclen_guess always returns -2 for characters of two or more bytes,
I consider that what isn't mbclen_cache should b
Hi Jim,
Thank you for the review, test and fix for the patch.
I have nothing that can be improved after your change.
Norihiro
Package: grep
Tags: patch
In UTF-8 character set, an alphabet may have not only upper case and
lower case but title case. grep-2.16 fails in matching as following
in order not to take it into consideration.
echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
Sorry, I've attached the patch, which is wrong.
I redress it.
case-fold-title-case.txt
Description: Binary data
Paul Eggert wrote:
> 1. It doesn't solve the problem from the ordinary user's point of view.
> For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still
> output nothing, because the one-character pattern "?" does not match
> the two-character string "lj" even when the latter's two-lette
Hi Jim,
Sorry for the trouble. When I submit future patches, I will create them
with "git format-patch --stdout -1".
Hi,
Slow down may be caused by the patch, because MBCSET is processed by not
DFA engine but regexp engine.
I tested performance on grep-2.17 and the version which the patch is reverted.
Latter is 100x faster.
yes $(printf '%078dm' 0)|head -1 > in
grep-2.17 original:
$ for i in $(seq 10); do
Package: grep
Tags: patch
In recent years, grep matcher is very fast by improving the dfa engine.
On the other hands, fgrep matcher only uses kwset engine, which isn't
generally very good at for case-insensitive matching.
The patch enables to switch case-insensitive matching with fgrep matcher
in
Hi Jim,
Your patch is probably right.
However, I think that the true cause for 100x slow is that DFA engine is
slower than regex engine for case-insensitive matching on a non-UTF-8
locle.
On a multibyte locale, for case-insensitive "a" grep prefers DFA engine,
but for character class "[Aa]" pref
In following case, about 200-400x faster. It's equal to performance of grep.
Patch#16232 may also work effectively.
- Before the patch
$ yes $(printf '%078dm' 0)| head -100 | tr 0 a > in
$ for i in 1 2 3 4 5; do env LC_ALL=ja_JP.UTF-8 time src/fgrep -i 'a' in; done
Command exited with non-zer
Package: grep
Tags: patch
The patch is DFA version of patch#16544 "Optimazation for is_mb_middle".
It will improve performance for non-UTF8 locales in DFA engine.
I tested below. In both case, Speed-up 3-3.5x.
$ yes $(printf '%078dm' 0)|head -100 > in
$ for i in `seq 5`; do env LC_ALL=ja_JP
Hi Jim,
I thank you for your review and pointing the bug for the patch. You are
right. I have written the wrong if conditions. I think that behavior
shouldn't be changed for the pcre or fgrep matcher by the patch. I have
fixed its bug, and re-send the patch and results of tests.
Norihiro
av
I used the attachment on this mail to test for "removal of
trivial_case_ignore".
Norihiro
removal_of_trivial_case_ignore.txt
Description: Binary data
Package: grep
Tags: patch
I have overlooked the important thing about optimization by
trivial_case_ignore. After optimization by trivial_case_ignore,
kwset engine can be used yet. However, if remove trivial_case_ignore,
it's never used longer because kwsmusts does nothing when MB_CUR_MAX > 1
&&
Package: grep
Tags: patch
I found difference between dfa and regex (glibc) treatment of titlecase.
In case-insensitive matching in UTF8 locale, U+01C7 (LATIN CAPITAL LETTER
LJ) matches with U+01C8 (LATIN CAPITAL LETTER L WITH SMALL LETTER J on
regex, but it doesn't on dfa.
The patch fixes mismat
Hi Paul
Thank you for checking the patch.
> First, why does the first patch add those four using_utf8 calls to
> parse_bracket_exp? Isn't that optimization valid regardless of
> whether the multibyte encoding is UTF-8?
The optimization which MBCSET is changed into CSET in addtok is completed
on
I have added several modifications to the patch.
First, I fixed the bug for titlecase.
Second, I changed it so that prefered replacement to OR to CSET in order
to reduce a number of states.
Third, I modified comments in source code and put drafts of commit
messages in the patch.
Norihiro
patc
Package: grep
Tags: patch
The patch avoids to add same character to a bracket expression in
trivial_case_ignore. That may be able to generate smaller tokens in
multibyte locales.
For example, FULLWIDTH LATIN CAPITAL LETTER A (ef bd 81) will transform
as below, because multibyte characters in CSE
Paul Eggert wrote:
> On second thought, I may have been too strict here. I suppose one
> could interpret POSIX to say that since 'σ' == tolower (toupper ('?')),
> that it should be OK for the pattern 'σ' to match the string '?' when
> ignoring case, even though the characters differ and are both l
Paul Eggert wrote:
> IIRC it's because a CSET matches any byte, while the corresponding
> MBCSET only matches that byte if it is a single-byte character.
> So for example, say "\x82\x61" is a two-byte character. The CSET "A"
> will match it but the corresponding MBCSET will not.
>
> This can happ
Paolo Bonzini wrote:
> What about these two commands:
>
> grep [a]
> grep -i A
>
> Would they match \x82\x61 ("B", U+0FF22) with your patch? And without it?
No match for all.
--
Before the patch:
$ locale -a | grep sjis
ja_JP.sjis
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/gre
Hi Paul,
Thanks for a lot of investigation. I have understood that we cannot
generally tell whether DFA's or regex's behavior is right.
I have tested the behavior of sereral regex engines. What's interesting
is that most of results differ from others. And nobody will understand
which is right.
Norihiro Tanaka wrote:
> And nobody will understand which is right.
However, I still believe that upper or lower case of a character should
also match title case, because I think that title case is extension of
cases (such as upper or lower), and furthermore they also matches title
case (tho
Paolo Bonzini wrote:
> Right, it's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE.
Yes. It's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE, so no problem.
Norihiro
Package: grep
Tags: patch
DFA may be build the superset of itself, which is the same as the itself
expect ANYCHAR, MBCSET and BACKREF are replaced CSET set full bits
followed by STAR, and mb_cur_max is equal to 1, by the patch.
For example, if given the pattern `a\(b\)c\1', the tokens of original
I fixed the bug which doesn't QMARK and PLUS in dfasuperset() and
modified serveral comments.
patch.txt
Description: Binary data
I make an update and add the draft of the commit log for the patch.
Norihiro
patch.txt
Description: Binary data
Sorry, the patch still had bugs. I fixed them. I confirmed that the
patched version passed all regression tests.
patch.txt
Description: Binary data
Sorry, the patch still had bugs. I fixed them. I confirmed that the
patched version passed all regression tests.
patch.txt
Description: Binary data
Package: grep
Tags: patch
The Boyer-Moore algorithm runs in O(m n) in the worst case,
which perhaps it may be much slower than the DFA.
The Galil rule enables to change O(m n) into O(n) for its case without
overheads and/or slow-down for other cases by avoiding to compare more
than once for a po
I changed the patch so that the delta2 shift is extracted from the trie,
because it's very excellent.
Norihiro
>From 932e0774428e9b5015c9de31b8a509a5d01c4abe Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Sat, 15 Mar 2014 14:41:52 +0900
Subject: [PATCH] grep: optimization by u
m.
Norihiro
>From 180ad10aa80c22b3ca67ff7201cf578a594f6de9 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Sun, 16 Mar 2014 09:33:25 +0900
Subject: [PATCH] grep: removal of trivial_case_ignore
When change kwsmusts as it's used even if fill MB_CUR_MAX > 1 and
case-insensitive, DFA give
by line. However all of buffer is passed to re_search and
re_match. I seem that it's wrong.
Norihiro
>From 7187092186b982b95e94df81393e8fa72060985c Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Mon, 17 Mar 2014 23:46:31 +0900
Subject: [PATCH] grep: matching line-by-line w
eal 1.21
user 0.71
sys 0.46
Norihiro
>From d69cf4d289034a21067a6e0a7495921df0a2aac9 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Mon, 17 Mar 2014 20:41:25 +0900
Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales
* src/dfa.c (dfaexec): prefer regex to for ANYCH
character fixed string from tokens.
Norihiro
>From 7a67844524c0657fc395966536805d9736c0a88e Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Tue, 18 Mar 2014 21:01:47 +0900
Subject: [PATCH] grep: open CSET and transform into the upper case when
MB_CUR_MAX == 1 in dfamust
In MB_CUR_MAX
on Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Sat, 22 Mar 2014 15:11:52 +0900
Subject: [PATCH] grep: avoid to re-build a state built previously.
* src/dfa.c (dfaexec): avoid to re-build a state built previously.
---
src/dfa.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git
>From e56992c4bfdb2e02a114b14c34780672a9c8cee9 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Sun, 23 Mar 2014 20:14:33 +0900
Subject: [PATCH] grep: optimization of DFA by reuse of multi-byte buffers in
non-UTF8 locales
* src/dfa.c (struct dfa): New members `mblen_buf', `nmblen_buf',
`inputwcs', `
Package: grep
Tags: patch
`]' should also take into special characters in fgrep_to_grep_pattern.
Norihiro
>From 47e891d0c66259c506db466f830bdf963037999a Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Mon, 24 Mar 2014 22:58:21 +0900
Subject: [PATCH] grep: addition of ]
Hi Paul,
Sorry, I was wrong. `]' isn't a special character without `['.
On Mon, 24 Mar 2014 08:13:17 -0700
Paul Eggert wrote:
> Norihiro Tanaka wrote:
> > `]' should also take into special characters in fgrep_to_grep_pattern.
>
> Sorry, I'm not se
Hi Senthil,
(out of bugs, only list)
I don't seem that it's a grep's bug.
Is the thing which you want to do `grep -r "Mr\.*" f1.dat*' and
`grep -r "Mrs\.*" f1.dat*'? ^
^
Norihiro
On Mon, 24 Mar 2014 19:10:53 +0530
Senthil Kumar wrote:
> Dear Person,
>
> I have
r.*" f1.dat*) is getting retrieved all
> names having both
>
> *Mr. and Mrs.*
> please do let me know is this defect ? Hoping to get an revert mail on this.
>
> --
>
> * Thanks & Regards,KK Senthil Kumar*
--
田中 紀洋 (Norihiro TANAKA)
E-mail : nori...@kcn.ne.jp
Package: grep
Tags: patch
When fail in checking multibyte characters after exact matched in KWSet,
I think that we can proceed the `beg' pointer before running DFA, because
then will never match at the former position than the failed in text.
Norihiro
RnJvbSAxYmYwZGRiMzYyNTk1NjUyZmQ0MDAwOGViNGRhN
Sorry, I failed in attachment of the patch. I re-send it.
>From 1bf0ddb362595652fd40008eb4da50f17e1f1358 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Wed, 26 Mar 2014 00:41:48 +0900
Subject: grep: proceed the `beg' pointer after exact matched in KWSet
* src/dfasearch.c (E
The patch that sent previously had the bug. It's necessary to run DFA
in more narrow range without moving the `beg' pointer. The bug is fixed
in this patch.
RnJvbSA3MTE1OGIyZmE3OTkzNzliZGNkYjZmNWFjMWI5M2Y3ODU2NmZiZDQ0IE1vbiBTZXAgMTcg
MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2N
Sorry for the repeated failure.
I re-send it.
>From 71158b2fa799379bdcdb6f5ac1b93f78566fbd44 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Wed, 26 Mar 2014 00:41:48 +0900
Subject: [PATCH] grep: running DFA in more narrow range after failure in
exact match
* src/dfasearch.c (EGexec
Fail in reversed-range-endpoints test after egrep and fgrep go back to
shell scripts. I seem that the program name doesn't remove correctly.
Norihiro
>From f937bbb04826b0fb36aaeb96d95e0ac2a7ac3e33 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Wed, 26 Mar 2014 23:06:30 +0900
Eric Blake wrote:
> Your patch is once again illegible.
Sorry, I resent it.
Norihiro
Jim,
Thanks, I have added the comments to the patch and have slightly
modified the comment you wrote.
Norihiro
From a5540fa9f5e5b9339afe59b3d8e1b3b4791397e4 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Thu, 27 Mar 2014 21:34:42 +0900
Subject: [PATCH] grep: perform the kwset-helping DFA
e into new member of struct dfa.
When struct dfa more than one are used at the same time, mbrtowc cache
may be conflict. So, take mbrtowc_cache into new member of struct dfa,
and define each mbrtowc cache for them.
Norihiro
From 41bfd2f66a48efc0cdf1b865c2cc4cdb48d98ce0 Mon Sep 17 00:00:00
I rebased this patch, and add a bug fix to it.
If `elems' of `follows' is re-allocated in transit_state(), It may cause
a segfault. So, I changed so that don't copy d->mb_follows to `follows'
variable.
From 92abd82f0d1d42da7c68a3bb3d2d6079073120ae Mon Sep 17 00:00:00 200
iginal dfa. (dfasuperset)
3. Change return type of dfahint(). It can check whether used or not
from caller.(dfahint)
4. If both kwset and dfahint() aren't used, run DFA matcher in whole
range still.
Norihiro
From 17f5934d50b121ef3f7c98b0b0db3ae8c891b8d4 Mon Sep 17 00:00:00 2001
Jim,
Thanks, I checked that it acts as expected.
Norihiro
If fails in line matching at a position found by kwsexec(), the line
never matches in line matching. So the line is skipped.
Norihiro
From b8f24ddeb7ddf211a4dce662734ef4387d48b4c2 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Sun, 30 Mar 2014 21:03:58 +0900
Subject: [PATCH] grep: speed
Hi Paolo,
I wrote the patch to speed-up Boyer-Moore algorithm in KWSet at bug#17013.
As next step, I want to be able to use it for case-insensitive matching,
too. Further more, I wrote the patch which when case_fold flag is set
Boyer-Moore algorithm can be used for CSET at patch#17034.
However,
Hi Paolo,
I applied same type and naming to member `backref' of dfastate.
And I checked to pass regression tests.
Thanks,
Norihiro
From 7cbf75fd2e8156f20e34d1d163fe28d6fc1306f1 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Tue, 1 Apr 2014 23:48:16 +0900
Subject: [PATCH] grep: p
Hi Paulo,
> For ANYCHAR, you can convert it to CSET{1,mb_cur_max} or, even better,
> (single-CSET | lead-CSET full-CSET{0,mb_cur_max-1}).
I seem that it's complicated. The superset requires a memory area that
is different from the original DFA and additional costs to build it. And
exact matchi
Hi Paolo,
> I'm worried that the "STAR" method will match basically everything.
If no normal char and/or CSET is included in the pattern, the superset
won't be used.
> We're using something like CSET{1,mb_cur_max} already for UTF-8, so the size
> increase for that should not be too bad.
We can
Paolo Bonzini wrote:
> Yeah, but my problem is that a.b will look at a very long line if it
> is translated to a[\x0-\xff]*b. Better translate it to a[\x0-\xff]{1,2}b
> or something similar.
I seem that it's no problem.
For example, I try following text for the pattern `a.b'. Whereas the
digit
Norihiro Tanaka wrote:
> For example, I try following text for the pattern `a.b'.
In UTF8, the pattern `a.b' doesn't use the superset. Consider `a[d-z]b'
and/or `\(a\)\1b' instead of it.
Norihiro
Paolo Bonzini wrote:
> Better translate it to a[\x0-\xff]{1,2}b or something similar.
I also thought that previously. However, since we don't ask an exact
match for the superset, that is believed to be meaningless.
Norihiro
Paolo Bonzini wrote:
> Does anything change if there are a few million c's?
The superset of `a ANYCHAR b' is 'a CSET STAR b'.
It's DFA states are following.
s0: The position set is none.
s1: The position set is 1:a
s2: The position set is 1:a 2:CSET
s3: The position set is 1:a 2:CSET 3:b (accep
Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 1:a 2:CSET 3:b (accepted)
Sorry, it was wrong. It should be as follows.
s0: The position set is none.
s1: The position set is
I changed the type of `has_backref' into `bool'.
Norihiro
From 11bf4318c360c29a3000afee8ee9f41ec431130e Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Tue, 1 Apr 2014 23:48:16 +0900
Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales
* src/dfa.c (dfaexe
In second patch, I changed so that Boyer-Moore algorithm could be used
also to case-insensitive matching if MB_CUR_MAX == 1. It works with
patch#17019 and patch#17034.
From 25f72238cdda4f3372aaa9181075f975832ef50f Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Sat, 15 Mar 2014 14:41:52
I fixed the bug in the patch. Added call of resetmust().
From ac54299352bf5feb5cb7a5f24f49c4d019dcc23b Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Tue, 18 Mar 2014 21:01:47 +0900
Subject: [PATCH] grep: open CSET and transform into the upper case when
MB_CUR_MAX == 1 in dfamust
* src
We need to intialize the new member.
I add it to the patch.
From 11bf4318c360c29a3000afee8ee9f41ec431130e Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka
Date: Tue, 1 Apr 2014 23:48:16 +0900
Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales
* src/dfa.c (dfaexec): prefer
Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 2:CSET 3:b (accepted)
> s4: The position set is 2:CSET
Sorry, it was wrong yet. It should be as follows.
s0: The position set i
1 - 100 of 453 matches
Mail list logo