bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2016-05-01 Thread Paul Eggert
I have installed this and am closing the bug report.

bug#18777: [PATCH] dfa: improvement for checking of multibyte

2016-04-20 Thread Jim Meyering
On Wed, Apr 20, 2016 at 11:21 PM, Paul Eggert wrote: > I'm attaching a revised patch, relative to the latest grep, to implement the > idea of the Bug#18777 patch. This revision calls the new array "never_trail" > instead of "always_character_boundary" to nail down the concept a bit more > precisel

bug#18777: [PATCH] dfa: improvement for checking of multibyte

2016-04-20 Thread Paul Eggert
I'm attaching a revised patch, relative to the latest grep, to implement the idea of the Bug#18777 patch. This revision calls the new array "never_trail" instead of "always_character_boundary" to nail down the concept a bit more precisely. It also removes what appears to be an unnecessary p < mb

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2015-01-16 Thread Norihiro Tanaka
On Fri, 19 Dec 2014 00:54:58 +0900 Norihiro Tanaka wrote: > On Thu, 18 Dec 2014 01:40:18 -0800 > Thanks, I understood that you said. You are right. I changed the patch > so that always_character_boundary is not pruned even if WCP != NULL, and > fixed the API document. I fixed a mismatch with t

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-18 Thread Norihiro Tanaka
On Thu, 18 Dec 2014 01:40:18 -0800 Paul Eggert wrote: > Why? The (only) caller with WCP != NULL doesn't use *WCP when > skip_remains_mb (D, P, ..., WCP) returns P. So it's OK to not set *WCP > in that case. Thanks, I understood that you said. You are right. I changed the patch so that always_

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-18 Thread Paul Eggert
Norihiro Tanaka wrote: if WCP != NULL, we must set a wide character for 0x95 0x5c to WCP before return P. Why? The (only) caller with WCP != NULL doesn't use *WCP when skip_remains_mb (D, P, ..., WCP) returns P. So it's OK to not set *WCP in that case.

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-17 Thread Norihiro Tanaka
On Wed, 17 Dec 2014 09:46:09 -0800 Paul Eggert wrote: > Yes, and that's the point: we don't want this if-statement to be pruned > if WCP != NULL. We want the code to return P right away in the typical > case where P is at a character boundary. If MBP is way less than P, > this will save the wor

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-17 Thread Paul Eggert
On 12/17/2014 09:21 AM, Norihiro Tanaka wrote: If WCP != NULL, all of following code will be pruned, although I think that it is ignorable for the performance. if (wcp == NULL && always_character_boundary[*p]) return p; Yes, and that's the point: we don't want this if-statement to be p

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-17 Thread Norihiro Tanaka
On Tue, 16 Dec 2014 16:06:54 -0800 Paul Eggert wrote: > did you mean "robust in the presence of future changes? Yes. However, I might have made too big a deal of the effect about "Portable". > True, but I wasn't worried so much about that. I was worried about the > case where WCP != NULL: the

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-16 Thread Paul Eggert
Norihiro Tanaka wrote: However, first it is no longer portable after remove it. "portable"? This issue is independent of platform, surely. By "portable" did you mean "robust in the presence of future changes? Second if it is compiled with GCC 4.3 or later, the function is inlined by and "

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-16 Thread Norihiro Tanaka
On Tue, 16 Dec 2014 09:12:21 -0800 Paul Eggert wrote: > > This part of the patch does too much work, as the caller inspects *WCP > only when skip_remains_mb returns a value not equal to p. So there's > no need for the "wcp == NULL &&" test in the patch. Instead, the > documented API can change,

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-16 Thread Paul Eggert
On 12/16/2014 04:42 AM, Norihiro Tanaka wrote: Thanks for the review and suggestion. If using_utf8 () is true, we can set always_character_boundary to true except 0x80-0xbf. Even better, thanks. >This won't assign anything to *WCP, contrary to the documented API for >for skip_remains_mb. T

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-16 Thread Norihiro Tanaka
On Mon, 15 Dec 2014 09:43:54 -0800 Paul Eggert wrote: > Can't we improve this when using_utf8 () is true? In that case, every > ASCII character is always single byte. Also, the bytes 0xc0, 0xc1, > and 0xf5 through 0xff can be added to the table: they are not > single-byte characters but they ar

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-15 Thread Paul Eggert
On 12/15/2014 06:59 AM, Norihiro Tanaka wrote: +/* True if each byte can not occur inside a multibyte character */ +static bool always_single_byte[NOTCHAR]; + +static void +dfaalwayssb (void) +{ + size_t i; + unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' }; + for (i = 0; i < sizeof

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-12-15 Thread Norihiro Tanaka
On Mon, 20 Oct 2014 10:07:20 -0600 Eric Blake wrote: > POSIX requires that NUL, slash, dot, newline, and carriage return all be > single bytes that cannot occur inside a multibyte character (because > they have special meaning to file name resolution and/or terminal > interaction); it added this

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-22 Thread Norihiro Tanaka
arn...@skeeve.com wrote: > Gawk does not remove CR in advance, unless someone specifically > set RS = "\r\n", in which case the full regex matcher is used > to first find \r\n in the raw input buffer. Thanks, I also confirmed it on source code of Gawk. > So for gawk, adding a check for (c == eolb

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-21 Thread arnold
Hi. Norihiro Tanaka wrote: > arn...@skeeve.com wrote: > > I would think adding a check for '\r' would be safe and would help > > too; given that on Windows systems '\r' generally occurs just as > > frequently as '\n', it should give a nice speedup for gawk on those > > systems. > > As I recogniz

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-21 Thread Norihiro Tanaka
arn...@skeeve.com wrote: > I would think adding a check for '\r' would be safe and would help > too; given that on Windows systems '\r' generally occurs just as > frequently as '\n', it should give a nice speedup for gawk on those > systems. As I recognize that DFA and regex aren't support multipl

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-20 Thread arnold
Norihiro Tanaka wrote: > Eric Blake wrote: > > Is it worth extending your optimization to all five of the > > POSIX-guaranteed single byte characters? > > Thanks, but I don't want to perform it immediately. DFA has already > regarded newline as a single byte character, but hasn't others yet. S

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-20 Thread Norihiro Tanaka
Eric Blake wrote: > Is it worth extending your optimization to all five of the > POSIX-guaranteed single byte characters? Thanks, but I don't want to perform it immediately. DFA has already regarded newline as a single byte character, but hasn't others yet. So, we may need to make many changes

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-20 Thread Eric Blake
On 10/20/2014 09:04 AM, Norihiro Tanaka wrote: > This patch improves performance for input string which doesn't match > even the first part of a pattern. Although there is no less effective > for grep as it uses a superset of DFA, gawk speeds up about 40%. > > > When found newline, we can skip

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-20 Thread Norihiro Tanaka
Norihiro Tanaka wrote: > $ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k The file `k' is below. $ yes `printf '%040d' 0` | head -1000 >../k

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

2014-10-20 Thread Norihiro Tanaka
This patch improves performance for input string which doesn't match even the first part of a pattern. Although there is no less effective for grep as it uses a superset of DFA, gawk speeds up about 40%. $ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k (before) real 2.85 user 2.79