Re: [Bug 754272] Re: Range matching incorrect in UTF-8

Glenn Maynard Sat, 18 Jun 2011 18:16:49 -0700

On Sat, Jun 18, 2011 at 7:47 PM, Marcel Stimberg
<754...@bugs.launchpad.net>wrote:


> Well, it is a difficult problem and there is no easy solution -- for
> scripts etc. you just have to use the POSIX locale, only then the behaviour
> is well defined. For your first example, yes, LC_COLLATE=C should be used.
> But your "[á-ú]" example is a good one for showing the difficulty: With the
> current behaviour one has at least an idea about what it will match, but
> using Unicode codepoint ordering this would also match '÷' and 'ø'...
>

That's obvious from looking at a Unicode character table, which you'll
always have to do for Unicode ranges.  However, I doubt anyone who hasn't
already been bitten by this issue would ever expect this:

# echo a | egrep '[A-Z]'
# echo b | egrep '[A-Z]'
b

The problem is that collation is meant for collation; it's unsuitable for
range matching.

IMHO, those examples are not that realistic anyway, scripts often set
> LC_ALL=C for parsing the output of other programs and in situations where
> Unicode is really needed, things like [[:upper:]] mostly suffice.
>

I shouldn't have to choose between sane ranges and Unicode support.  (By the
way, it's usually LC_COLLATE=C that's wanted rather than LC_ALL=C.  For
example, 'A.B' won't match 'A本B' if LC_ALL=C.)

Other examples where you want ranges by codepoint order, just off the top of
my head: '[ぁ-ヾｦ-ﾟ]' (incomplete) to match Japanese hiragana and katakana;
similar expressions for matching line-drawing and symbol ranges; matching
things like SASLprep tables (http://tools.ietf.org/html/rfc3454); finding
strings that will require UTF-16 surrogate pairs ('[�-]'), and finding
strings which are unrepresentable in UTF-16 and/or UCS-2.

But please also see bug 759849 and the comments, there seem to be some
> upstream changes regarding the issue.
>

regex(7) seems to have similar breakage--presumably they mimic each other,
both being GNU tools--so I don't think it helps.  It probably moves the
issue into glibc, though.

-- 
Glenn Maynard

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/754272

Title:
  Range matching incorrect in UTF-8

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grep/+bug/754272/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 754272] Re: Range matching incorrect in UTF-8

Reply via email to