-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark,

Mark Thomas wrote:
> Christopher Schultz wrote:
>> Unfortunately, I can't find anywhere in the spec that says which
>> encoding to use in URLs for % HEX HEX encoding.
> 
> AFAIR there is a W3C recommendation that it should be UTF-8 but it isn't
> mandated by any spec.

I'd be interested in reading that recommendation. There is a lot of
conflicting information out there:

http://www.w3.org/Addressing/URL/uri-spec.html
Section: "Conventional URI encoding scheme"
"
Where the local naming scheme uses ASCII characters which are not
allowed in the URI, these may be represented in the URL by a percent
sign "%" immediately followed by two hexadecimal digits (0-9, A-F)
giving the ISO Latin 1 code for that character. Character codes other
than those allowed by the syntax shall not be used unencoded in a URI.
"

This other document (referred to by the first one) conflicts with the
former:

http://www.w3.org/Addressing/URL/url-spec.txt
Section "ENCODING PROHIBITED CHARACTERS"
"
This specification makes no assumptions or requirements about the
character sets, if any, referred to be the (decoded) octets a URL.
"

Then, there's the official RFC 3986 (URI Generic Syntax):
http://gbiv.com/protocols/uri/rfc/rfc3986.html#characters
"
The URI syntax provides a method of encoding data, presumably for the
sake of identifying a resource, as a sequence of characters. The URI
characters are, in turn, frequently encoded as octets for transport or
presentation. This specification does not mandate any particular
character encoding for mapping between URI characters and the octets
used to store or transmit those characters. When a URI appears in a
protocol element, the character encoding is defined by that protocol;
without such a definition, a URI is assumed to be in the same character
encoding as the surrounding text.
"

In the introduction of the previous document, we find this:

"
Percent-encoded octets (Section 2.1) may be used within a URI to
represent characters outside the range of the US-ASCII coded character
set if this representation is allowed by the scheme or by the protocol
element in which the URI is referenced. Such a definition should specify
the character encoding used to map those characters to octets prior to
being percent-encoded for the URI.
"

But wait, there's more (Section 2.5: Identifying Data)

"When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the
data should first be encoded as octets according to the UTF-8 character
encoding [STD63]; then only those octets that do not correspond to
characters in the unreserved set should be percent-encoded. For example,
the character A would be represented as "A", the character LATIN CAPITAL
LETTER A WITH GRAVE would be represented as "%C3%80", and the character
KATAKANA LETTER A would be represented as "%E3%82%A2"."


So, there you have it: there is no official charset to use; an official
charset ought to be defined; the officially defined charset is UTF-8. WTF?!

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG+SBM9CaO5/Lv0PARAj0cAJ90eyq430RKd1B1ftQzTjPznxbYCQCfauhM
zdLEKTznnv29c7t6N6p3+R4=
=mopd
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to