-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark,
Mark Thomas wrote: > Christopher Schultz wrote: >> Unfortunately, I can't find anywhere in the spec that says which >> encoding to use in URLs for % HEX HEX encoding. > > AFAIR there is a W3C recommendation that it should be UTF-8 but it isn't > mandated by any spec. I'd be interested in reading that recommendation. There is a lot of conflicting information out there: http://www.w3.org/Addressing/URL/uri-spec.html Section: "Conventional URI encoding scheme" " Where the local naming scheme uses ASCII characters which are not allowed in the URI, these may be represented in the URL by a percent sign "%" immediately followed by two hexadecimal digits (0-9, A-F) giving the ISO Latin 1 code for that character. Character codes other than those allowed by the syntax shall not be used unencoded in a URI. " This other document (referred to by the first one) conflicts with the former: http://www.w3.org/Addressing/URL/url-spec.txt Section "ENCODING PROHIBITED CHARACTERS" " This specification makes no assumptions or requirements about the character sets, if any, referred to be the (decoded) octets a URL. " Then, there's the official RFC 3986 (URI Generic Syntax): http://gbiv.com/protocols/uri/rfc/rfc3986.html#characters " The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. " In the introduction of the previous document, we find this: " Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. " But wait, there's more (Section 2.5: Identifying Data) "When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2"." So, there you have it: there is no official charset to use; an official charset ought to be defined; the officially defined charset is UTF-8. WTF?! - -chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG+SBM9CaO5/Lv0PARAj0cAJ90eyq430RKd1B1ftQzTjPznxbYCQCfauhM zdLEKTznnv29c7t6N6p3+R4= =mopd -----END PGP SIGNATURE----- --------------------------------------------------------------------- To start a new topic, e-mail: users@tomcat.apache.org To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]