On Fri, Apr 15, 2005 at 05:12:54PM +0000, [EMAIL PROTECTED] wrote: : Isn't that what the difference between byte-level and codepoint-level : access to strings is all about. If you want to work with values that : are illegal codepoints then you should be working at the byte-level : not the codepoint-level, at least by default.
Sure, but there's no guarantee you have access to a lower level, depending on the interface presented by the object in question, and you shouldn't probably have to know that anyway, if there's a useful abstraction level at which "illegal character" means something as a unit to the higher level. The fact is that U+FFFF is an illegal character regardless of the encoding, and I'd like to be able to talk about it as a character, without having to know whether it's an illegal UTF-8 byte sequence, or an illegal UTF-16 byte sequence, or a 256-bit integer stored somewhere that you just aren't allowed to think about certain values of. In short, "legal" Unicode strings should probably be viewed as a constrained subtype of strings, not as a storage type. I know you've known Ada from its infancy. :-) Perl 6 makes the same distinction, and can presumably get at the unconstrained type for any constrained type. So if you hand me a Unicode string with arbitrary value restrictions, there had better be a way to view that string without the arbitrary restrictions. You need to be able to determine somehow that types Even or Odd have a storage class of type Int. Larry