Re: [pugs] regexp "bug"?

hv Fri, 15 Apr 2005 05:22:31 -0700

"Mark A. Biggar" <[EMAIL PROTECTED]> wrote:
:BÁRTHÁZI András wrote:
:
:> Hi,
:> 
:> This code:
:> 
:> my $a='A';
:> $a ~~ s:perl5:g/A/{chr(65535)}/;
:> say $a.bytes;
:> 
:> Outputs "0". Why?
:> 
:> Bye,
:>   Andras
:> 
:
:\uFFFF is not a legal unicode codepoint.  chr(65535) should raise an 
:exception of some type.  So the above code does seem show a possible 
:bug. But as that chr(65535) is an undefined char, who knows what the 
:code is acually doing.


In perl5 at least, we support a wider concept of codepoints than the
Unicode consortium. This allows us to use strings for a wider variety
of things than just Unicode text (eg version strings, bit vectors etc).

In perl6 the greatly expanded set of types will presumably allow us
to distinguish actual Unicode data from more arbitrary sequences of
codepoints, and I'd normally expect that the more constrained type
would be a subtype of the less constrained type. In this case that
means I'd expect "Unicode string" to be a subtype of something like
"codepoint sequence".

(In fact it'd probably be useful to have more levels than that - there
are times when you need the Unicode concepts for things like [[:digit:]],
but may be able to get better performance by avoiding the checks for
'legal Unicode codepoint'.)

On the other hand you will probably be able to achieve the things p5
overloads onto strings using packed integer arrays, so maybe this all
represents unnecessary complications. In which case maybe 'relaxed'
variants of Unicode strings aren't needed. We will probably still want
other sorts of strings though, such as ASCII.

Hugo

Re: [pugs] regexp "bug"?

Reply via email to