Re: Bug#387704: grep: -i breaks \W in some locales (perhaps UTF-8 locales only)

2009-04-15 Thread Norihiro Tanaka
Hi, A pattern is coverted to lower case before compile in match icase (grep.c:mb_icase_keys). \B and \W doesn't corrently work so that each are converted to \b and \w.

Re: bug? --color/--only-matching when !MBS_SUPPORT or MB_CUR_MAX==1

2009-04-20 Thread Norihiro Tanaka
Hi, It seems that the problem is derived from included regex. If you use --without-included-regex, copy system regex.h to `lib' sub-directory and enable search.c:196. If you --with-included-regex, copy regex from glibc (2.3 or later) to `lib' sub-directory' % cp glibc-2.3.6/posix/reg* grep-2.

Re: why is grep so slow?

2009-04-25 Thread Norihiro Tanaka
Hi, Searching for text of multi-byte characters, grep converts all of then to wide characters, even parts of string which doesn't match with a pattern as single-byte. See Bug#14472.

Re: why is grep so slow?

2009-04-25 Thread Norihiro Tanaka
> and that's not done with -P, right? thanks for the response. Only Grep with -P uses PCRE library, which doesn't understand multi-byte locale except UTF-8.

Re: grep for unicode text files?

2009-04-27 Thread Norihiro Tanaka
Hi, Unlike the VI editor (VIM), Grep doesn't automatically recognize character set of a text. You need to set locale and character set to LC_ALL, LANG, etc. Can Cygwin understand utf-16 ?

Re: undocumented \S "joyride" operator

2009-05-15 Thread Norihiro Tanaka
Hi, > grep 2.5.4 has an undocumented \S operator: It means it doesn't be supported by grep 2.5.4. Grep 2.5.4 uses regex, which is included in GNU libc and supports `\S' operand. However, Grep 2.5.4 also use own engines, which can't interpret `\S' operand. So you mayn't use undocumented ope

Re: [PATCH] pcre: replace pcre-config with pkg-config

2009-05-15 Thread Norihiro Tanaka
> The attached patch are for grep 2.5.1a and 2.5.4. It doesn't work as it is. I have made it work in it. Furthermore this patch changes that fgrep and egrep don't be linked to libpcre, because of no dependency of fgrep and egrep on libpcre. grep-2.5.4.libpcre.patch Description: Binary data

Re: undocumented \S "joyride" operator

2009-05-15 Thread Norihiro Tanaka
Hi, > grep 2.5.4 has an undocumented \S operator: It means it doesn't be supported by grep 2.5.4. Grep 2.5.4 uses regex, which is included in GNU libc and supports `\S' operand. However, Grep 2.5.4 also use own engines, which can't interpret `\S' operand. So you mayn't use undocumented ope

Re: undocumented \S "joyride" operator

2009-05-18 Thread Norihiro Tanaka
Try to use included regex to invalidate \S operand, or to apply the following patch to validate it. grep-2.5.4.dfa-isspace.patch Description: Binary data

Re: How to use ']' in the upper bound of a range character set

2009-05-21 Thread Norihiro Tanaka
Hi, We can't use and/or escape `]' between `[' and `]' in grep and egrep. Given cases is interpreted respectively as follows. - grep -E "[1-\]]" file_input [1-\] ] CAT where [1-\] is range cset. - grep -E "[1-\\]]" file_input [1-\\]]CAT where [1-\\]

Re: grep interprets \s in a confusing way

2009-06-03 Thread Norihiro Tanaka
Hi, See following thread. http://lists.gnu.org/archive/html/bug-grep/2009-05/msg9.html >Grep 2.5.4 uses regex, which is included in GNU libc and supports > `\S' operand. However, Grep 2.5.4 also use own engines, which can't > interpret `\S' operand. So you mayn't use undocumented op

Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match

2010-03-13 Thread Norihiro Tanaka
Hi, By this patch, even when multibyte-check failed for a simple pattern that doesn't contain the wild-card and the repetition expression, `dfaexec' will have called. Do you intend it?

Re: [PATCH 05/17] dfa, grep: cleanup if-before-free and cast-of-argument-to-free

2010-03-13 Thread Norihiro Tanaka
Hi, > I'm not happy with removing the null checks in calls to free(); there > were systems out there that would throw a fatal error if you passed > null to free(). I'd prefer to leave those checks in. Though I also thought so first, in this case I seem it's guaranteed that elements that is small

Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match

2010-03-14 Thread Norihiro Tanaka
Hi, When a line matches with kwset and fails in the is_mb_middle test, bug#23814 is caused by not checking following of the line (Never looks for the second match for the line.) . In this case, by matching kwset for following, the bug will be solved. For a simple pattern which doesn't contain th

Re: [PATCH 05/17] dfa, grep: cleanup if-before-free and cast-of-argument-to-free

2010-03-14 Thread Norihiro Tanaka
Hi, > I'm not happy with removing the null checks in calls to free(); there > were systems out there that would throw a fatal error if you passed > null to free(). I'd prefer to leave those checks in. Though I also thought so first, in this case I seem it's guaranteed that elements that is small

Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.

2010-03-19 Thread Norihiro Tanaka
Hi, Many of patch#6899 might be unnecessary, because the performance issue as bug#14472 was improved in development release. But I think kwsmb.patch looks still very effective.

Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match

2010-03-19 Thread Norihiro Tanaka
Hi, I think that it would be better to be corrected as follows. Please point out if the idea is wrong. diff -ru grep-2.5.4.183-9159-dirty.orig/src/search.c grep-2.5.4.183-9159-dirty/src/search.c --- grep-2.5.4.183-9159-dirty.orig/src/search.c 1970-01-01 00:00:01.0 + +++ grep-2.5

Re: new snapshot available: grep-2.5.4.183-9159

2010-03-19 Thread Norihiro Tanaka
Hi, I tried to build grep-2.5.4.183-9159, but I was received the warning on RHEL5. When it has not been corrected yet, I hope you correct it. diff -ru grep-2.5.4.183-9159-dirty.orig/src/dfa.c grep-2.5.4.183-9159-dirty/src/dfa.c --- grep-2.5.4.183-9159-dirty.orig/src/dfa.c1970-01-01 00:00:0

Re: [patch #7132] Small change to grep-2.6

2010-03-24 Thread Norihiro Tanaka
Hi Paolo, > > diff1: > >I seem that It should match at the head of the line when start_ptr > >isn't set. > > Do you have a testcase? I'm hesitant to apply this without one. No. Though I had the test case for this patch, I have lost it... A little time is necessary to make it. > > dif

Re: [patch #7132] Small change to grep-2.6

2010-03-24 Thread Norihiro Tanaka
Hi, Jim This problem is easily reproduced with both -w option and backref. -- #!/bin/sh # This would fail for grep-2.6 : ${srcdir=.} . "$srcdir/init.sh"; path_prepend_ ../src printf 'foo foo bar\n' > exp1 || framework_failure fail=0 for LOC in en_US.UTF-8 zh_CN $LOCALE_FR_UTF8; do out=ou

Re: [PATCH 2/3] grep: reset state after truncated or invalid multibyte sequences

2010-03-24 Thread Norihiro Tanaka
> > Thank you for the patch. > > Do either of you have a test case? > > No, or I would have included it. But it matches what grep does in > general to handle this case. > > Paolo Thanks. Though I have no test case, I also think invalid sequence regarded as single-byte shouldn't affect to follo

Re: [patch #7134] Patch for is_mb_middle in searchutil.c

2010-03-27 Thread Norihiro Tanaka
Hi, Thank you for your advice. I have requested assignment for changes. However it may take some time... > In future, please consider providing patches in "git format-patch" form, > so it's less work for us. Here are some guidelines that should help: > (they're technically for coreutils, but app

Re: [PATCH] grep: remove unnecessary code

2010-03-27 Thread Norihiro Tanaka
Hi Jim, regex included up to grep 2.5.4 didn't support RE_ICASE, so we had to convert pattern and input to lower case in ignore-case match beforehand. However, In grep 2.6 regex has been updated. I seem that now we no longer need to keep it.

prefix of multibyte on grep-2.6.2

2010-03-30 Thread Norihiro Tanaka
Hi, I have tested grep-2.6.2. However, I seem the fix for prefix of multibyte is insufficient. Please run following test case. -- #!/bin/sh # This would mistakenly print a line prior to grep-2.6.2. : ${srcdir=.} . "$srcdir/init.sh"; path_prepend_ ../src encode() { echo "$1" | tr ABC '\357\274\2

Re: Bug#577095: grep: bracket expressions fails depending on the locale

2010-04-09 Thread Norihiro Tanaka
Hi, I seem that is expected behavior. [A-Z] includes A,b,B,c,C,...y,Y,z,Z in en_US locale (not include `a').

Re: [PATCH v2] dfa: optimize UTF-8 period

2010-04-19 Thread Norihiro Tanaka
Do you regard sizeof (int) as 32-bit ? If CHARCLASS_INTS == 4, we mayn't be able to compile following code correctly. > + static const charclass utf8_classes[5] = { > + { 0, 0, 0, 0, ~0, ~0, 0, 0 },/* 80-bf: non-lead bytes > */ > + { ~0, ~0, ~0, ~0, 0, 0, 0, 0 },

Re: GNU grep 2.7 missing library dependency

2010-10-12 Thread Norihiro Tanaka
Try following. $ CPPFLAGS=-I/usr/local/include \ LD_FLAGS=-L/usr/local/lib/hpux32 \ ./configure --without-libiconv-prefix --without-libintl-prefix

bug#16421: Speed-up for case-insensitive matching in multibyte locales

2014-01-11 Thread Norihiro Tanaka
Package: grep Tags: patch Case-insensitive matching is expensive in multi-byte locales because of conversion of targeted text to lower case. However, I seem that awk which uses dfa.c as well as grep don't covert target text to lower case. I seem that if grep don't use kwset, it doesn't also have

bug#16421: Speed-up for case-insensitive matching in multibyte locales

2014-01-12 Thread Norihiro Tanaka
I'm sorry the content of the attachment is incorrect. I send the correct file. grep-ignore-icase.txt Description: Binary data

bug#16544: Optimazation for is_mb_middle

2014-01-24 Thread Norihiro Tanaka
Package: grep Tags: patch When matched characters to a regular expression is found by kwsexec or dfaexec, we need check whether it is in the middle of a multi-byte character. `is_mb_middle' of searchutils.c is used for it. However, it's expensive, even if most of them contain constitute with sing

bug#16421: Speed-up for case-insensitive matching in multibyte locales

2014-01-25 Thread Norihiro Tanaka
Hi Jim, I thank you for your review for the patch. I have any requests of any changes for the modified comments and commit log. However, can you merge an additional patch, which is attached on this mail, into the commit? No longer `kwsincr_case' is called with case-insensitive matching in a mul

bug#16421: Speed-up for case-insensitive matching in multibyte locales

2014-01-25 Thread Norihiro TANAKA
Sorry, you are right. the declaration of kwset_exact_matches shouldn't be removed.

bug#16544: Optimazation for is_mb_middle

2014-01-28 Thread Norihiro Tanaka
I'm sorry that I don't test the patch sufficiently. I fixed several bugs in the patch. In addition to the patch, I attach the results of the compile and the performance test. is_mb_middle.txt Description: Binary data make.txt Description: Binary data test.txt Description: Binary data

bug#16544: Optimazation for is_mb_middle

2014-01-29 Thread Norihiro Tanaka
Hi Paul, Thank you for reviewing tha patch. > Please use something like this instead All right. > A minor question about naming: in what sense is mbclen_guess a guess? Because mbclen_guess always returns -2 for characters of two or more bytes, I consider that what isn't mbclen_cache should b

bug#16544: Optimazation for is_mb_middle

2014-02-02 Thread Norihiro Tanaka
Hi Jim, Thank you for the review, test and fix for the patch. I have nothing that can be improved after your change. Norihiro

bug#16631: Consideration of title case on case-insensitive matching

2014-02-03 Thread Norihiro Tanaka
Package: grep Tags: patch In UTF-8 character set, an alphabet may have not only upper case and lower case but title case. grep-2.16 fails in matching as following in order not to take it into consideration. echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ

bug#16631: Consideration of title case on case-insensitive matching

2014-02-03 Thread Norihiro Tanaka
Sorry, I've attached the patch, which is wrong. I redress it. case-fold-title-case.txt Description: Binary data

bug#16631: Consideration of title case on case-insensitive matching

2014-02-07 Thread Norihiro Tanaka
Paul Eggert wrote: > 1. It doesn't solve the problem from the ordinary user's point of view. > For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still > output nothing, because the one-character pattern "?" does not match > the two-character string "lj" even when the latter's two-lette

bug#16544: Optimazation for is_mb_middle

2014-02-10 Thread Norihiro Tanaka
Hi Jim, Sorry for the trouble. When I submit future patches, I will create them with "git format-patch --stdout -1".

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

2014-02-19 Thread Norihiro Tanaka
Hi, Slow down may be caused by the patch, because MBCSET is processed by not DFA engine but regexp engine. I tested performance on grep-2.17 and the version which the patch is reverted. Latter is 100x faster. yes $(printf '%078dm' 0)|head -1 > in grep-2.17 original: $ for i in $(seq 10); do

bug#16823: Use DFA regex engine on fgrep matcher

2014-02-20 Thread Norihiro Tanaka
Package: grep Tags: patch In recent years, grep matcher is very fast by improving the dfa engine. On the other hands, fgrep matcher only uses kwset engine, which isn't generally very good at for case-insensitive matching. The patch enables to switch case-insensitive matching with fgrep matcher in

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

2014-02-20 Thread Norihiro TANAKA
Hi Jim, Your patch is probably right. However, I think that the true cause for 100x slow is that DFA engine is slower than regex engine for case-insensitive matching on a non-UTF-8 locle. On a multibyte locale, for case-insensitive "a" grep prefers DFA engine, but for character class "[Aa]" pref

bug#16823: Use DFA regex engine on fgrep matcher

2014-02-20 Thread Norihiro Tanaka
In following case, about 200-400x faster. It's equal to performance of grep. Patch#16232 may also work effectively. - Before the patch $ yes $(printf '%078dm' 0)| head -100 | tr 0 a > in $ for i in 1 2 3 4 5; do env LC_ALL=ja_JP.UTF-8 time src/fgrep -i 'a' in; done Command exited with non-zer

bug#16842: [PATCH] Use mbrtowc_cache in DFA engine

2014-02-22 Thread Norihiro Tanaka
Package: grep Tags: patch The patch is DFA version of patch#16544 "Optimazation for is_mb_middle". It will improve performance for non-UTF8 locales in DFA engine. I tested below. In both case, Speed-up 3-3.5x. $ yes $(printf '%078dm' 0)|head -100 > in $ for i in `seq 5`; do env LC_ALL=ja_JP

bug#16893: [PATCH] Avoid matching line-by-line for case-insensitive with grep

2014-02-28 Thread Norihiro Tanaka
Hi Jim, I thank you for your review and pointing the bug for the patch. You are right. I have written the wrong if conditions. I think that behavior shouldn't be changed for the pcre or fgrep matcher by the patch. I have fixed its bug, and re-send the patch and results of tests. Norihiro av

bug#16893: [PATCH] Avoid matching line-by-line for case-insensitive with grep

2014-02-28 Thread Norihiro Tanaka
I used the attachment on this mail to test for "removal of trivial_case_ignore". Norihiro removal_of_trivial_case_ignore.txt Description: Binary data

bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine

2014-03-01 Thread Norihiro Tanaka
Package: grep Tags: patch I have overlooked the important thing about optimization by trivial_case_ignore. After optimization by trivial_case_ignore, kwset engine can be used yet. However, if remove trivial_case_ignore, it's never used longer because kwsmusts does nothing when MB_CUR_MAX > 1 &&

bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase

2014-03-01 Thread Norihiro Tanaka
Package: grep Tags: patch I found difference between dfa and regex (glibc) treatment of titlecase. In case-insensitive matching in UTF8 locale, U+01C7 (LATIN CAPITAL LETTER LJ) matches with U+01C8 (LATIN CAPITAL LETTER L WITH SMALL LETTER J on regex, but it doesn't on dfa. The patch fixes mismat

bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine

2014-03-01 Thread Norihiro Tanaka
Hi Paul Thank you for checking the patch. > First, why does the first patch add those four using_utf8 calls to > parse_bracket_exp? Isn't that optimization valid regardless of > whether the multibyte encoding is UTF-8? The optimization which MBCSET is changed into CSET in addtok is completed on

bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine

2014-03-02 Thread Norihiro Tanaka
I have added several modifications to the patch. First, I fixed the bug for titlecase. Second, I changed it so that prefered replacement to OR to CSET in order to reduce a number of states. Third, I modified comments in source code and put drafts of commit messages in the patch. Norihiro patc

bug#16927: [PATCH] grep: avoid to add same character to a bracket expression

2014-03-03 Thread Norihiro Tanaka
Package: grep Tags: patch The patch avoids to add same character to a bracket expression in trivial_case_ignore. That may be able to generate smaller tokens in multibyte locales. For example, FULLWIDTH LATIN CAPITAL LETTER A (ef bd 81) will transform as below, because multibyte characters in CSE

bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase

2014-03-03 Thread Norihiro Tanaka
Paul Eggert wrote: > On second thought, I may have been too strict here. I suppose one > could interpret POSIX to say that since 'σ' == tolower (toupper ('?')), > that it should be OK for the pattern 'σ' to match the string '?' when > ignoring case, even though the characters differ and are both l

bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine

2014-03-04 Thread Norihiro Tanaka
Paul Eggert wrote: > IIRC it's because a CSET matches any byte, while the corresponding > MBCSET only matches that byte if it is a single-byte character. > So for example, say "\x82\x61" is a two-byte character. The CSET "A" > will match it but the corresponding MBCSET will not. > > This can happ

bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine

2014-03-05 Thread Norihiro Tanaka
Paolo Bonzini wrote: > What about these two commands: > > grep [a] > grep -i A > > Would they match \x82\x61 ("B", U+0FF22) with your patch? And without it? No match for all. -- Before the patch: $ locale -a | grep sjis ja_JP.sjis $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/gre

bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase

2014-03-05 Thread Norihiro TANAKA
Hi Paul, Thanks for a lot of investigation. I have understood that we cannot generally tell whether DFA's or regex's behavior is right. I have tested the behavior of sereral regex engines. What's interesting is that most of results differ from others. And nobody will understand which is right.

bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase

2014-03-05 Thread Norihiro Tanaka
Norihiro Tanaka wrote: > And nobody will understand which is right. However, I still believe that upper or lower case of a character should also match title case, because I think that title case is extension of cases (such as upper or lower), and furthermore they also matches title case (tho

bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine

2014-03-05 Thread Norihiro Tanaka
Paolo Bonzini wrote: > Right, it's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE. Yes. It's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE, so no problem. Norihiro

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-03-07 Thread Norihiro Tanaka
Package: grep Tags: patch DFA may be build the superset of itself, which is the same as the itself expect ANYCHAR, MBCSET and BACKREF are replaced CSET set full bits followed by STAR, and mb_cur_max is equal to 1, by the patch. For example, if given the pattern `a\(b\)c\1', the tokens of original

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-03-07 Thread Norihiro Tanaka
I fixed the bug which doesn't QMARK and PLUS in dfasuperset() and modified serveral comments. patch.txt Description: Binary data

bug#16823: Use DFA regex engine on fgrep matcher

2014-03-09 Thread Norihiro Tanaka
I make an update and add the draft of the commit log for the patch. Norihiro patch.txt Description: Binary data

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-03-09 Thread Norihiro Tanaka
Sorry, the patch still had bugs. I fixed them. I confirmed that the patched version passed all regression tests. patch.txt Description: Binary data

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-03-09 Thread Norihiro Tanaka
Sorry, the patch still had bugs. I fixed them. I confirmed that the patched version passed all regression tests. patch.txt Description: Binary data

bug#17013: [PATCH] grep: optimization by using the Galil rule for Boyer-Moore algorithm in KWSet

2014-03-14 Thread Norihiro Tanaka
Package: grep Tags: patch The Boyer-Moore algorithm runs in O(m n) in the worst case, which perhaps it may be much slower than the DFA. The Galil rule enables to change O(m n) into O(n) for its case without overheads and/or slow-down for other cases by avoiding to compare more than once for a po

bug#17013: [PATCH] grep: optimization by using the Galil rule for Boyer-Moore algorithm in KWSet

2014-03-14 Thread Norihiro Tanaka
I changed the patch so that the delta2 shift is extracted from the trie, because it's very excellent. Norihiro >From 932e0774428e9b5015c9de31b8a509a5d01c4abe Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Sat, 15 Mar 2014 14:41:52 +0900 Subject: [PATCH] grep: optimization by u

bug#17019: [PATCH] grep: removal of trivial_case_ignore

2014-03-15 Thread Norihiro Tanaka
m. Norihiro >From 180ad10aa80c22b3ca67ff7201cf578a594f6de9 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Sun, 16 Mar 2014 09:33:25 +0900 Subject: [PATCH] grep: removal of trivial_case_ignore When change kwsmusts as it's used even if fill MB_CUR_MAX > 1 and case-insensitive, DFA give

bug#17025: [PATCH] grep: matching line-by-line with regex

2014-03-17 Thread Norihiro Tanaka
by line. However all of buffer is passed to re_search and re_match. I seem that it's wrong. Norihiro >From 7187092186b982b95e94df81393e8fa72060985c Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Mon, 17 Mar 2014 23:46:31 +0900 Subject: [PATCH] grep: matching line-by-line w

bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales

2014-03-17 Thread Norihiro Tanaka
eal 1.21 user 0.71 sys 0.46 Norihiro >From d69cf4d289034a21067a6e0a7495921df0a2aac9 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Mon, 17 Mar 2014 20:41:25 +0900 Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales * src/dfa.c (dfaexec): prefer regex to for ANYCH

bug#17034: [PATCH] grep: open CSET and transform into the upper case when MB_CUR_MAX == 1 in dfamust

2014-03-18 Thread Norihiro Tanaka
character fixed string from tokens. Norihiro >From 7a67844524c0657fc395966536805d9736c0a88e Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Tue, 18 Mar 2014 21:01:47 +0900 Subject: [PATCH] grep: open CSET and transform into the upper case when MB_CUR_MAX == 1 in dfamust In MB_CUR_MAX

bug#17066: a DFA state which is built previously may be re-built in non-UTF8 locales

2014-03-22 Thread Norihiro Tanaka
on Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Sat, 22 Mar 2014 15:11:52 +0900 Subject: [PATCH] grep: avoid to re-build a state built previously. * src/dfa.c (dfaexec): avoid to re-build a state built previously. --- src/dfa.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git

bug#17070: [PATCH] grep: optimization of DFA by reuse of multi-byte buffers in non-UTF8 locales

2014-03-23 Thread Norihiro Tanaka
>From e56992c4bfdb2e02a114b14c34780672a9c8cee9 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Sun, 23 Mar 2014 20:14:33 +0900 Subject: [PATCH] grep: optimization of DFA by reuse of multi-byte buffers in non-UTF8 locales * src/dfa.c (struct dfa): New members `mblen_buf', `nmblen_buf', `inputwcs', `

bug#17082: [PATCH] grep: addition of ]' to special characters

2014-03-24 Thread Norihiro Tanaka
Package: grep Tags: patch `]' should also take into special characters in fgrep_to_grep_pattern. Norihiro >From 47e891d0c66259c506db466f830bdf963037999a Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Mon, 24 Mar 2014 22:58:21 +0900 Subject: [PATCH] grep: addition of ]

bug#17082: [PATCH] grep: addition of ]' to special characters

2014-03-24 Thread Norihiro Tanaka
Hi Paul, Sorry, I was wrong. `]' isn't a special character without `['. On Mon, 24 Mar 2014 08:13:17 -0700 Paul Eggert wrote: > Norihiro Tanaka wrote: > > `]' should also take into special characters in fgrep_to_grep_pattern. > > Sorry, I'm not se

bug#17086: Reg : Bug in Grep command

2014-03-25 Thread Norihiro Tanaka
Hi Senthil, (out of bugs, only list) I don't seem that it's a grep's bug. Is the thing which you want to do `grep -r "Mr\.*" f1.dat*' and `grep -r "Mrs\.*" f1.dat*'? ^ ^ Norihiro On Mon, 24 Mar 2014 19:10:53 +0530 Senthil Kumar wrote: > Dear Person, > > I have

bug#17086: Reg : Bug in Grep command

2014-03-25 Thread Norihiro Tanaka
r.*" f1.dat*) is getting retrieved all > names having both > > *Mr. and Mrs.* > please do let me know is this defect ? Hoping to get an revert mail on this. > > -- > > * Thanks & Regards,KK Senthil Kumar* --  田中 紀洋 (Norihiro TANAKA)  E-mail : nori...@kcn.ne.jp

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-25 Thread Norihiro Tanaka
Package: grep Tags: patch When fail in checking multibyte characters after exact matched in KWSet, I think that we can proceed the `beg' pointer before running DFA, because then will never match at the former position than the failed in text. Norihiro RnJvbSAxYmYwZGRiMzYyNTk1NjUyZmQ0MDAwOGViNGRhN

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-25 Thread Norihiro Tanaka
Sorry, I failed in attachment of the patch. I re-send it. >From 1bf0ddb362595652fd40008eb4da50f17e1f1358 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Wed, 26 Mar 2014 00:41:48 +0900 Subject: grep: proceed the `beg' pointer after exact matched in KWSet * src/dfasearch.c (E

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-26 Thread Norihiro Tanaka
The patch that sent previously had the bug. It's necessary to run DFA in more narrow range without moving the `beg' pointer. The bug is fixed in this patch. RnJvbSA3MTE1OGIyZmE3OTkzNzliZGNkYjZmNWFjMWI5M2Y3ODU2NmZiZDQ0IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2N

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-26 Thread Norihiro Tanaka
Sorry for the repeated failure. I re-send it. >From 71158b2fa799379bdcdb6f5ac1b93f78566fbd44 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Wed, 26 Mar 2014 00:41:48 +0900 Subject: [PATCH] grep: running DFA in more narrow range after failure in exact match * src/dfasearch.c (EGexec

bug#17098: [PATCH] tests: failure in reversed-range-endpoints test after egrep and fgrep go back to shell scripts

2014-03-26 Thread Norihiro Tanaka
Fail in reversed-range-endpoints test after egrep and fgrep go back to shell scripts. I seem that the program name doesn't remove correctly. Norihiro >From f937bbb04826b0fb36aaeb96d95e0ac2a7ac3e33 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Wed, 26 Mar 2014 23:06:30 +0900

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-26 Thread Norihiro Tanaka
Eric Blake wrote: > Your patch is once again illegible. Sorry, I resent it. Norihiro

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-27 Thread Norihiro Tanaka
Jim, Thanks, I have added the comments to the patch and have slightly modified the comment you wrote. Norihiro From a5540fa9f5e5b9339afe59b3d8e1b3b4791397e4 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Thu, 27 Mar 2014 21:34:42 +0900 Subject: [PATCH] grep: perform the kwset-helping DFA

bug#16842: [PATCH] Use mbrtowc_cache in DFA engine

2014-03-28 Thread Norihiro Tanaka
e into new member of struct dfa. When struct dfa more than one are used at the same time, mbrtowc cache may be conflict. So, take mbrtowc_cache into new member of struct dfa, and define each mbrtowc cache for them. Norihiro From 41bfd2f66a48efc0cdf1b865c2cc4cdb48d98ce0 Mon Sep 17 00:00:00

bug#17070: [PATCH] grep: optimization of DFA by reuse of multi-byte buffers in non-UTF8 locales

2014-03-28 Thread Norihiro Tanaka
I rebased this patch, and add a bug fix to it. If `elems' of `follows' is re-allocated in transit_state(), It may cause a segfault. So, I changed so that don't copy d->mb_follows to `follows' variable. From 92abd82f0d1d42da7c68a3bb3d2d6079073120ae Mon Sep 17 00:00:00 200

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-03-28 Thread Norihiro Tanaka
iginal dfa. (dfasuperset) 3. Change return type of dfahint(). It can check whether used or not from caller.(dfahint) 4. If both kwset and dfahint() aren't used, run DFA matcher in whole range still. Norihiro From 17f5934d50b121ef3f7c98b0b0db3ae8c891b8d4 Mon Sep 17 00:00:00 2001

bug#17095: [PATCH] grep: proceed the `beg' pointer after exact matched in KWSet

2014-03-28 Thread Norihiro Tanaka
Jim, Thanks, I checked that it acts as expected. Norihiro

bug#17143: [PATCH] grep: speed-up for line matching in fgrep

2014-03-30 Thread Norihiro Tanaka
If fails in line matching at a position found by kwsexec(), the line never matches in line matching. So the line is skipped. Norihiro From b8f24ddeb7ddf211a4dce662734ef4387d48b4c2 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Sun, 30 Mar 2014 21:03:58 +0900 Subject: [PATCH] grep: speed

bug#17019: [PATCH] grep: removal of trivial_case_ignore

2014-04-01 Thread Norihiro Tanaka
Hi Paolo, I wrote the patch to speed-up Boyer-Moore algorithm in KWSet at bug#17013. As next step, I want to be able to use it for case-insensitive matching, too. Further more, I wrote the patch which when case_fold flag is set Boyer-Moore algorithm can be used for CSET at patch#17034. However,

bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales

2014-04-01 Thread Norihiro Tanaka
Hi Paolo, I applied same type and naming to member `backref' of dfastate. And I checked to pass regression tests. Thanks, Norihiro From 7cbf75fd2e8156f20e34d1d163fe28d6fc1306f1 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Tue, 1 Apr 2014 23:48:16 +0900 Subject: [PATCH] grep: p

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-01 Thread Norihiro Tanaka
Hi Paulo, > For ANYCHAR, you can convert it to CSET{1,mb_cur_max} or, even better, > (single-CSET | lead-CSET full-CSET{0,mb_cur_max-1}). I seem that it's complicated. The superset requires a memory area that is different from the original DFA and additional costs to build it. And exact matchi

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-01 Thread Norihiro Tanaka
Hi Paolo, > I'm worried that the "STAR" method will match basically everything. If no normal char and/or CSET is included in the pattern, the superset won't be used. > We're using something like CSET{1,mb_cur_max} already for UTF-8, so the size > increase for that should not be too bad. We can

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-01 Thread Norihiro Tanaka
Paolo Bonzini wrote: > Yeah, but my problem is that a.b will look at a very long line if it > is translated to a[\x0-\xff]*b. Better translate it to a[\x0-\xff]{1,2}b > or something similar. I seem that it's no problem. For example, I try following text for the pattern `a.b'. Whereas the digit

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-01 Thread Norihiro Tanaka
Norihiro Tanaka wrote: > For example, I try following text for the pattern `a.b'. In UTF8, the pattern `a.b' doesn't use the superset. Consider `a[d-z]b' and/or `\(a\)\1b' instead of it. Norihiro

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-01 Thread Norihiro Tanaka
Paolo Bonzini wrote: > Better translate it to a[\x0-\xff]{1,2}b or something similar. I also thought that previously. However, since we don't ask an exact match for the superset, that is believed to be meaningless. Norihiro

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-02 Thread Norihiro Tanaka
Paolo Bonzini wrote: > Does anything change if there are a few million c's? The superset of `a ANYCHAR b' is 'a CSET STAR b'. It's DFA states are following. s0: The position set is none. s1: The position set is 1:a s2: The position set is 1:a 2:CSET s3: The position set is 1:a 2:CSET 3:b (accep

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-02 Thread Norihiro Tanaka
Norihiro Tanaka wrote: > s0: The position set is none. > s1: The position set is 1:a > s2: The position set is 1:a 2:CSET > s3: The position set is 1:a 2:CSET 3:b (accepted) Sorry, it was wrong. It should be as follows. s0: The position set is none. s1: The position set is

bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales

2014-04-02 Thread Norihiro Tanaka
I changed the type of `has_backref' into `bool'. Norihiro From 11bf4318c360c29a3000afee8ee9f41ec431130e Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Tue, 1 Apr 2014 23:48:16 +0900 Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales * src/dfa.c (dfaexe

bug#17013: [PATCH] grep: optimization by using the Galil rule for Boyer-Moore algorithm in KWSet

2014-04-02 Thread Norihiro Tanaka
In second patch, I changed so that Boyer-Moore algorithm could be used also to case-insensitive matching if MB_CUR_MAX == 1. It works with patch#17019 and patch#17034. From 25f72238cdda4f3372aaa9181075f975832ef50f Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Sat, 15 Mar 2014 14:41:52

bug#17034: [PATCH] grep: open CSET and transform into the upper case when MB_CUR_MAX == 1 in dfamust

2014-04-02 Thread Norihiro Tanaka
I fixed the bug in the patch. Added call of resetmust(). From ac54299352bf5feb5cb7a5f24f49c4d019dcc23b Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Tue, 18 Mar 2014 21:01:47 +0900 Subject: [PATCH] grep: open CSET and transform into the upper case when MB_CUR_MAX == 1 in dfamust * src

bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales

2014-04-03 Thread Norihiro Tanaka
We need to intialize the new member. I add it to the patch. From 11bf4318c360c29a3000afee8ee9f41ec431130e Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Tue, 1 Apr 2014 23:48:16 +0900 Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales * src/dfa.c (dfaexec): prefer

bug#16966: [PATCH] grep: optimization with the superset of DFA

2014-04-03 Thread Norihiro Tanaka
Norihiro Tanaka wrote: > s0: The position set is none. > s1: The position set is 1:a > s2: The position set is 1:a 2:CSET > s3: The position set is 2:CSET 3:b (accepted) > s4: The position set is 2:CSET Sorry, it was wrong yet. It should be as follows. s0: The position set i

  1   2   3   4   5   >