On May 1, 2004, at 8:26 AM, Jarkko Hietaniemi wrote:
So it seems to me that the "obvious" way to go is to have all bit-s operations first convert to raw bytes (possibly throwing an exception) and then proceed to do their work.
If these conversions croak if there are code points beyond \x{ff}, I'm
fine with it. But trying to mix \x{100} or higher just leads into silly
discontinuities (basically we would need to decide on a word width, and
I think that would be a silly move).
Just FYI, the way I implemented bitwise-not so far, was to bitwise-not code points 0x{00}-0x{FF} as uint8-sized things, 0x{100}-0x{FFFF} as uint16-sized things, and > 0x{FFFF} as uint32-sized things (but then bit-masking them with 0xFFFFF to make sure that they fell into a valid code point range). That's pretty arbitrary, but if you bitwise-not as though everything were 32-bits wide, you'll end up with a "string" containing no assigned code points at all (they'll all be > 0x10FFFFF). But from a text point of view, bitwise-not on a string isn't a sensible operation no matter how you slice it (that is, even for 0x{00}-0x{FF}), so one flavor of arbitrary is just about as good as any other. We could also make anything > 0x{FF} map to either 0x{00} or 0x{FF}, or mask if with 0xFF to push it into that range. It's all pretty meaningless, as text transformations go, and I can't imagine anyone using it for anything, except maybe weak encryption.
This means that UTF-8 strings will be handled just fine, and (as I
Please don't mix encodings and code points. That strings might be serialized or stored as UTF-8 should have no consequence with bitops.
Exactly. And also realize that if you bitwise-not (or shift or something similar) the bytes of a UTF-8 serialization of something, the result isn't going to be valid UTF-8, so you'd be hard-pressed to lay text semantics down on top of it.
understand it) some subset of Unicode-at-large will be handled as well.
In other-words, the burden goes on the conversion functions, not on the
bit ops.
It's not that it's going to be meaningful in the general case, but if
I'd rather have meaningful results.
Exactly--and, meaningful operations to begin with.
I'm beginning to wonder if we're going to be square-rooting strings, and taking the array-th root of a hash.... :)
JEff