On 10/20/2014 09:04 AM, Norihiro Tanaka wrote: > This patch improves performance for input string which doesn't match > even the first part of a pattern. Although there is no less effective > for grep as it uses a superset of DFA, gawk speeds up about 40%. >
> > When found newline, we can skip check of a multibyte character boundary > before the character, as we assume newline as a single byte character. > by that. POSIX requires that NUL, slash, dot, newline, and carriage return all be single bytes that cannot occur inside a multibyte character (because they have special meaning to file name resolution and/or terminal interaction); it added this requirement fairly recently, but only after confirming that common existing locales satisfy this constraint. (The same is not true for most any other character; even though POSIX requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid those characters from also being bytes embedded within multibyte characters). Is it worth extending your optimization to all five of the POSIX-guaranteed single byte characters? -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature