On May 1, 2004, at 12:00 PM, Aaron Sherman wrote:

On Sat, 2004-05-01 at 14:18, Jeff Clites wrote:

Exactly. And also realize that if you bitwise-not (or shift or
something similar) the bytes of a UTF-8 serialization of something, the
result isn't going to be valid UTF-8, so you'd be hard-pressed to lay
text semantics down on top of it.

How are you defining "valid UTF-8"? Is there a codepoint in UTF-8 between \x00 and \xff that isn't valid? Is there a reason to ever do bitwise operations on anything other than 8-bit codepoints?

If you're dealing in terms of code points, then the UTF-8 encoding (or any other) has nothing to do with it.


If you are dealing in terms of bytes, then there are bytes sequences which don't encode any code point in the UTF-8 encoding. By "valid UTF-8", I'm referring to the definition of that encoding (and I should have said, "well-formed")--see section 3.9, item D36 of the Unicode Standard. In particular, bytes 0xC0, 0xC1, and 0xF5-0xFF cannot occur in UTF-8.

But if you're speaking in terms of code points, that's not relevant, but then neither is the encoding.

More to the point, I said all of this at the beginning of this thread.
You should not, at this point, be confused about the scope of what I
want to do, as it was very narrowly and clearly defined up-front.

And yet, I am confused. You said near the beginning of the thread:

On Fri, 2004-04-30 at 10:42, Dan Sugalski wrote:

Bitstring operations ought only be valid on binary data, though,
unless someone can give me a good reason why we ought to allow
bitshifting on Unicode. (And then give me a reasoned argument *how*,
too)

100% agree. If you want to play games with any other encoding, you may proceed to write your own damn code ;-)

Given that, I'm not sure how UTF-8 is coming into the picture.

JEff



Reply via email to