/el-niņo.jsp should be sent (per the w3c recommendation) as /el-nin%c3%b1o.jsp which is a UTF-8 encoded bytes sequences for any characters which aren't in the ~60 characters allowed from ASCII. The encoding used for the byte conversion is not specified in the official URI spec (RFC 2396), but the w3c in December recommended UTF-8 should be used by all. IE and Mozilla already appear to encode requests this way. The server is technically supposed to attempt to read the bytes as UTF-8 and decode with the platform default as a fallback.
For the record, /el-niņo.jsp is /el-nin%f1.jsp if the bytes are encoded via iso-latin-1. Any character >0x7f isn't safe will be encoded as 2-4 bytes under UTF-8. Certain byte sequences are also reserved. I've spent a long time with this trying to create truly internationalized code. If you look at the Java 1.4 Release Candidate you will see that they now recognize in URLEncode and URLDecode that this is the correct behaviour. URLEncode and URLDecode have deprecated methods that don't pass in the encoding. I think they should default to UTF-8, but the default is the platform default. The w3c has a good section on this at http://www.w3.org/International/O-URL-and-ident.html They also have Java source for encoding/decoding the URI's at http://www.w3.org/International/O-URL-code.html -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Monday, February 04, 2002 8:57 AM To: Tomcat Developers List Subject: Re: cvs commit: jakarta-tomcat RELEASE-PLAN-3.3.1.txt On Mon, 4 Feb 2002, Bill Barker wrote: > My understanding of this is that if the request is for: > /el-niņo.jsp > then most of the time Tomcat will read it correctly. But it will > return for > requestURI: > /el-ni%A1o.jso > The "safe chars" map to the same code points under iso-latin-1 and utf-8 > (that's why they are "safe chars"). UEncoder is strict in what is safe, but > the RFC isn't. You are allowed to use exteded chars if the other side is > capable of detecting the charset. I wouldn't change this behavior - I think it's better to return the second form rather than first. The URL is supposed to be 7-bit safe. It is something you can write on a paper or type on any keyboard. %A1 is not the same under 8859_1 and utf8 ( AFAIK - I may be wrong ). And "/el-niņo.jsp" is hard to type on a keyboard or to view for people with non-8859_1 charsets. ( %A1 will have a very different char ). IMHO the RFC is clear enough about what a 'safe char' is, and my understanding was that anything >0x7f isn't. ( the 'encoded' URI is something you are supposed to print, go to a different computer, type, and get to the page. You can't type ņ on a chinese or greek keyboard ) Costin -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>