-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 5/18/2010 4:18 AM, André Warnier wrote:
> Among other bizarre things, it does mean that in a URL, one has to
> encode the hostname using one method, and the rest of the URL using
> another method.  As if encoding issues were not already complicated enough.
> 
> This, and a lot of other non-USASCII encoding issues all through the
> web, point to the real need to move to a fully Unicode/UTF-8 based web
> infrastructure.  It rather puzzles me why this does not seem to be a
> major topic of discussion in forums such as this one.

+1

Unfortunately, you can't guarantee how the browser will interpret your
URL, no matter how you encode it. If you do a simple UTF-8 dump of the
URL (that is, no %-encoding, no nothing), you run the risk of using
"illegal" characters as far as the browser is concerned.

Technically speaking, this is impossible because MIME headers must be in
US-ASCII, so the ñ is not legal, therefore it must be encoded. The next
question is how to encode it. The Unicode code point for ñ is 00f1, so a
simple encoding might be %F1, but the browser is then free to decide if
that character is ñ (straight unicode) or something else (not sure
what... ISO-8859-1 0xf1 is also ñ).

If you encode it in UTF-8, you ought to get 0xc3b1 which should be
encoded into the URL as %c3%b1. Both of these encodings (ISO-8859-1 and
UTF-8) were observed by the OP under various circumstances.

The problem is that the rules and practice for browser behavior are ...
unclear. Even in the presence of such rules, every browser I've seen has
a setting for "use UTF-8 for URL encoding" and the default settings are,
I'm sure, inconsistent between browsers and even versions of the same
browser. Since the user can always choose to override what the server
expects, it's probably best to restrict your URLs to US-ASCII. For this
reason, we have stopped using GET for any requests that could reasonably
be expected to contain non-US-ASCII data (such as FORM submissions), but
this may mean that you have to "misspell" certain words (such as niño)
in file names and paths.

For my money, this URL should be the one to use (at least when ignoring
the "standard" for internationalized domain names mentioned elsewhere in
this thread): http://www.coru%c3%b1a.es.

Good luck,
- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv1obkACgkQ9CaO5/Lv0PBpBwCfamhsEyA+4Zf/srgt+BUrTu00
mfQAoL98xsYx470lIPljlqM2qbpJmpDB
=2PlI
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to