On Sat, Jun 18, 2011 at 7:47 PM, Marcel Stimberg <754...@bugs.launchpad.net>wrote:
> Well, it is a difficult problem and there is no easy solution -- for > scripts etc. you just have to use the POSIX locale, only then the behaviour > is well defined. For your first example, yes, LC_COLLATE=C should be used. > But your "[á-ú]" example is a good one for showing the difficulty: With the > current behaviour one has at least an idea about what it will match, but > using Unicode codepoint ordering this would also match '÷' and 'ø'... > That's obvious from looking at a Unicode character table, which you'll always have to do for Unicode ranges. However, I doubt anyone who hasn't already been bitten by this issue would ever expect this: # echo a | egrep '[A-Z]' # echo b | egrep '[A-Z]' b The problem is that collation is meant for collation; it's unsuitable for range matching. IMHO, those examples are not that realistic anyway, scripts often set > LC_ALL=C for parsing the output of other programs and in situations where > Unicode is really needed, things like [[:upper:]] mostly suffice. > I shouldn't have to choose between sane ranges and Unicode support. (By the way, it's usually LC_COLLATE=C that's wanted rather than LC_ALL=C. For example, 'A.B' won't match 'A本B' if LC_ALL=C.) Other examples where you want ranges by codepoint order, just off the top of my head: '[ぁ-ヾヲ-゚]' (incomplete) to match Japanese hiragana and katakana; similar expressions for matching line-drawing and symbol ranges; matching things like SASLprep tables (http://tools.ietf.org/html/rfc3454); finding strings that will require UTF-16 surrogate pairs ('[�-]'), and finding strings which are unrepresentable in UTF-16 and/or UCS-2. But please also see bug 759849 and the comments, there seem to be some > upstream changes regarding the issue. > regex(7) seems to have similar breakage--presumably they mimic each other, both being GNU tools--so I don't think it helps. It probably moves the issue into glibc, though. -- Glenn Maynard -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/754272 Title: Range matching incorrect in UTF-8 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/grep/+bug/754272/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs