UTF "Unicode Transfer Fromat" can be many different lengths.
UTF-8 uses 8 bit bytes to encode ISO-10646 code points.  When the code point
value is less than 65,768 (i.e. UNICODE) then UTF-8 will use up to 3 bytes
(24 bits) to encode the code point. However, when the code point value is
from the full 31 bit range of ISO-10646, then UTF-8 will use up to 7 bytes
(56 bits) to encode the 31 bit value.
There is also a UTF-16 encoding which sends 16 bit data units.

Note: Java caracters are UNICODE, not ISO-10646. Java cannot represent code
point values greater than 16 bits.
BTW. if M$ is encoding unicode code points as %XXXX when it is sending
UNICODE (16 bits per character) data, that is correct.  HOWEVER, if used
when sending 8-bit encodings such as UTF-8 this is a new M$ feature to lock
in their customers.

Tim

----- Original Message -----
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, May 19, 2001 4:27 PM
Subject: The wonderfull worlds of encodings...


> Hi,
>
> I've got a terible headache... It happens all the time I try to touch the
> bugs related with encodings - any of them...
>
> I'm sure you already know ( but I just found out ) what
> "surrogate" characters are. I know that UTF is _not_ 16 bits, but I had no
> idea it is 21 bits ( as opposed to UCS - 31 bits ).
>
> I'll try to get something working this weekend. Craig - you may want to
> take a look, the code in "DefaultServlet" is creating a writter for each
> encoding ( that's terribly expensive ), and doesn't seem to deal with
> surrogates ( well, the second part is not a problem - I doubt someone
> would use hieroglyphs or musical signs in a URL ).
>
> Now, the biggest problem is as ussually M$. From strange reasons, MSIE's
> javascript encode() method is generating %XXXX sequences instead of %XX%XX
> ( as most would expect ). That means the whole decoding might have to be
> rewritten 3.3 ( Apache doesn't deal with that either ).
>
> Question: what should happen with the context path ? It is supposed to be
> returned in the orignal form ( not decoded ) - but that can't work as a
> certain path can be encoded in many ways. I'm also not sure what should
> happen if web.xml and in server.xml ( where path is defined ) - should we
> use %xx encoded URLs ? But what would that mean for characters that have
> multiple encodings ?
>
>
> The solution I have in mind right now is to keep doing all the mappings
> and process web.xml - and do all internal operations with decoded
> characters, while keeping the "original" form for the facade, so servlets
> get what they expect.
>
> Any ideas ? I'm not sure I can handle this.
>
>
> Costin
>
>

Reply via email to