2014-02-03 André Warnier <a...@ice-sa.com>: > André Warnier wrote: >> >> Chris, >> >> a note : >> >> Christopher Schultz wrote: >> ... >> >> >>> >>> Without quoting, unquoted Cookie names and values may be any US-ASCII >>> character from 0x32 - 0x7e except for any of ("(" | ")" | "<" | ">" | >>> "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" >>> | "}" | SP | HT). None of the characters above are within that range, >>> so the cookie value must be quoted. (It looks to me like Cookie names >>> must always be in US-ASCII... I didn't think that was the case but I'm >>> not motivated to track-down every word of the spec looking for >>> justification). >>> >>> What is the character encoding of the request? What client are you >>> using? Who created the cookie in the first place? >>> >> >> I did the tracking down of the (tortuous) specs, and come to this : >> >> 1) the ISO-8859-1 character set includes "é" (Catalan and other languages) >> (*) >> >> 2) the US-ASCII character set is a subset of ISO-8859-1, and does not >> include "é". >> >> 3) The default character set for HTTP 1.1 is ISO-8859-1, as stated >> explicitly and implicitly in various places in RFC 2616 [1]. >> >> However, RFC 2616 does not define the "Cookie" nor "Set-Cookie" headers, >> and it also does not specifically indicate which character set should be >> used for HTTP Request/Response header values. It refers for that to MIME >> (RFC 822), which talks only about US-ASCII. >> >> 2) The "Cookie" and "Set-Cookie" headers seem to be subsequently and >> lastly defined in RFC 6265 [2]. >> In section 4.1.1 [3], the syntax of these headers is defined, as : >> >> cookie-pair = cookie-name "=" cookie-value >> cookie-name = token >> cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE ) >> cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E >> ; US-ASCII characters excluding CTLs, >> ; whitespace DQUOTE, comma, semicolon, >> ; and backslash >> token = <token, defined in [RFC2616], Section 2.2> >> >> Thus, it seems that you are right, and that a cookie *value* can >> (regrettably still) only consist of US-ASCII characters (not including "é" >> thus). >> >> (I cannot find in the specs a way to quote a non-US-ASCII character >> either; that's apparently only allowed in parts defined as "comments") >> >> (It is stated somewhere else in RFC 6265 that it is recommended to encode >> the Cookie value via e.g. Base64, if it were to potentially contain non >> US-ASCII characters). >> >> The cookie name is a "token", and the definition of "token" sends us back >> to RFC2616. >> In "2.2 Basic Rules", RFC2616 states : >> >> token = 1*<any CHAR except CTLs or separators> >> separators = "(" | ")" | "<" | ">" | "@" >> | "," | ";" | ":" | "\" | <"> >> | "/" | "[" | "]" | "?" | "=" >> | "{" | "}" | SP | HT >> ... >> CHAR = <any US-ASCII character (octets 0 - 127)> >> CTL = <any US-ASCII control character >> (octets 0 - 31) and DEL (127)> >> >> So, this all would tend to show that you are right, and that Cookie names >> (as well as values) can only consist of US-ASCII characters, and that "é" is >> thus not allowed (without some form of encoding that would represent it as a >> sequence of US-ASCII characters). >> >> Which, in my personal opinion is a lasting p-i-t-a and shame. And I >> cannot imagine how it can be nowadays that nobody has yet gotten around to >> proposing a HTTP 2.0 RFC where the default character set would be Unicode, >> UTF-8 encoded, for everything excluding maybe header names. But that's >> neither here nor there. >> >> To get back to the original OP's question thus, it seems to me that >> - Tomcat 7.x would not be in violation of the specs, if it indeed rejects >> a Cookie header containing any non-US-ASCII character (whether in the cookie >> name or value). >> - That the error message could be improved ("é" is not a control >> character, it's just invalid here) >> - but that the real fix for the OP may be to Base64-encode the cookie >> value before sending it to the browser. >> That's also because it may happen one day that even a browser which >> respects the specs to the letter (one never knows), could reject a value >> like : "abcé","abc","abc","abc","abc","abc","abc","abc","abc"; >> >> >> [1] http://tools.ietf.org/search/rfc2616 >> [2] http://tools.ietf.org/search/rfc6265 >> [3] http://tools.ietf.org/search/rfc6265#section-4.1.1 >> >> > > As an appendix, and triggered by another post to this list, here is another > way of encoding HTTP header values : > > Cookie: ACE_COOKIE=R660302447; TD3World=R760446058 > SM_TRANSACTIONID: > =?UTF-8?B?MGE2NDA2MDEtNDAzMy01MjdjYzlkMy0wMDBhLTJjMWI0NjJi?= > SM_AUTHTYPE: =?UTF-8?B?QXV0bw==?= > SM_SDOMAIN: =?UTF-8?B?LnRveW90YS1ldXJvcGUuY29t?= > > In this case, the cookie values are encoded using a "MIME extension" scheme > which indicates (between =? ? ?) prior to a string's value, the character > set/encoding in which the subsequent string is to be interpreted. > This is not explicitly mentioned in any of the above references, but as I > recall, this is part of another series of RFC's, maybe starting at this one > : > http://tools.ietf.org/html/rfc2184 > Now how this works out (also browser-side) with Cookie headers composed of > cookie names and values, I couldn't say. >
RFC 2616 also says the following on page 16: The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. TEXT = <any OCTET except CTLs, but including LWS> RFC 2047 is also referenced in Javadoc for HttpServletResponse.setHeader() The "B" encoding used in an example above is one of encodings allowed by RFC2047 ch.4.1. http://www.ietf.org/rfc/rfc2047.txt Best regards, Konstantin Kolinko --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org