Mirko Solic wrote:
Christopher thanks for quick replay.
...
I'm from Slovenija, Europe. We are using character that are not defined
in ASCII so we are using UTF-8 cp.
I will try to explain what is this application about.
This project (web page) is protected with AAI
(http://www.switch.ch/aai/about/). This Authentication and
Authorization infrastructure is roughly divided on SP (service provider)
and Idp (identity provider). SP is module in apache. So when user try to
get web page that is protected with AAI through apache, SP module checks
if user is alredy logged in. If not SP redirects user to Idp where user
can put his/her username and password. If everything is ok Idp sends
users data in xml to SP. SP puts this data into apache
environment variables so applications (web pages) can access it.
Here i use mod_jk to get this environment variables in tomcat in HTTP
header. If i print user data on apache side i get values in UTF-8
encoding but if i try this on tomcat i don't get right values i have to
make conversion.
Is it mod_jk responsible for converting UTF-8 environment variable to
ACSII header values or is this conversion made elsewhere?
Mirko,
I am from Belgium, Europe too. I live in Spain and work mostly for
German and other international customers (among which are some from
Poland too). This to say that I am well-aware of multi-lingual character
set issues, and confront them every day.
So, just so as to give you some "context" for your issues :
Despite the fact that Unicode and UTF-8 are now being increasingly used
on the web, the fact is that HTTP, and HTML, and most of the other
WWW-relevant RFCs, are still US-ASCII and ISO-8859-1 (latin-1) based.
For example, HTTP header values are /supposed/ to contain only
single-byte character codes that are part of the (printable subset of)
US-ASCII character set.
For example also, by default, all "content" exchanged between browsers
and web servers is iso-8859-1.
And it is so because the relevant RFCs say that it should be.
(So the developers of Apache and mod_jk and Tomcat have little choice in
the matter; they have to follow the RFCs).
This does not mean that you cannot handle other character sets on the
web. But it means that whenever you do, you have to be attentive to the
fact that it is /not/ the standard, and that you may have to do
character set translations and/or encoding.
It may even mean that, in order to exchange non-US-ASCII or
non-ISO-8859-1 data, you may have to use "tricks".
It also means that, in some cases, by using such "tricks", your
applications may become "non-standard", and will not necessarily work
with all servers and all clients.
So for example, to get back to your question above : mod_jk is not
responsible for translating anything, and will not translate anything.
That is because mod_jk follows the relevant WWW RFCs, which specify that
such and such data is ASCII or ISO-8859-1.
If the original HTTP request, as it is given by Apache to mod_jk,
contains HTTP headers, mod_jk will forward these headers "as is" to the
back-end Tomcat. But, because the HTTP RFC specifies that HTTP headers
should contain only US-ASCII character data, mod_jk would be allowed, if
it finds non-US-ASCII data in a HTTP header, to strip this data or
ignore the header or something like that. I don't know if mod_jk
actually does this, but if it did, it would be justified, because
according to the HTTP RFC this would be an invalid header.
So, to be practical :
- the current HTTP 1.1 RFC specifies that HTTP headers can only contain
US-ASCII printable character data
- some UTF-8 codes contain bytes that are not part of the US-ASCII
character set (e.g. : bytes with values above 0x7F)
- so, if you want to forward such a header from Apache to Tomcat, in
principle the "right" way is to "encode" the value of this header on the
Apache side, in such a way that it contains only US-ASCII data (for
example, using Base64 encoding), then pass it to mod_jk.
- at the other end, your application would have to decode this header
(using Base64 decoding) back into UTF-8, and then it would have to read
this header value as UTF-8/Unicode.
There is no guarantee that any standard module or class under Apache or
mod_jk or Tomcat would properly handle a header that contains
non-US-ASCII data. That because, in principle, they never have to.
I know it is a mess. It is possible that there are shortcuts. It is
possible that mod_jk would transmit a HTTP header, even if it contains
non-US-ASCII data. But it is not sure, because "the bible" for mod_jk,
as for Apache and as for Tomcat, are the RFCs.
We non-English speakers worldwide desperately need a new version of the
HTTP protocol where the default would be Unicode/UTF-8, for everything.
But I do not see much happening right now in that direction.
Maybe a tip for your authentication issues :
If, in the AJP <Connector> on the Tomcat side, you set the attribute
tomcatAuthentication="false"
then Tomcat will accept the user-id authenticated by Apache, as the
user-id for Tomcat (mod_jk transmits it).
So if your user authentication mechanism works fine at the Apache level,
and generates a user-id that is "acceptable" by Tomcat, this may be a
solution for your issue.
I have no idea if this user-id, for Tomcat, can or cannot contain
non-US-ASCII characters.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org