albrecht andrzejewski wrote:
I ran accros the ml archives, and i find some useful posts.

I've almost solved my problem: i can now display the accent (é è à) using request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");

It seems that the default charset for tomcat is ISO 8859 1
The j2ee javadoc says:

"If no charset is specified, ISO-8859-1 will be used."

I was pretty sure that tomcat handles UTF-8 by default, but it's not the case...at least for HttpServletResponse objects. Anyway, do you know if it's possible to set up a default charset for the wjole tomcat response, instead of calling these two methods every time a request reach the servlet... ?

I tried to define the CATALINA_OPTS, but perhaps the file encoding is different from the request/response encoding.
CATALINA_OPTS="-Dfile.encoding=UTF-8"
export LC_ALL CATALINA_OPTS


Take the following with caution, because I do not really know the underlying reason in Tomcat :

I have found that setting the LC_CTYPE environment variable to a UTF-8 "locale" (or inversely, to a ISO-8859-1 locale) prior to starting Tomcat influences the way in which *some* servlets are reading request bodies and/or encoding request responses. You can do this in the startup.sh script, or probably more correctly in the setenv.sh script, in the Tomcat/bin directory (that is, if your Tomcat is "the" canonical distribution; if your Tomcat comes from a pre-packaged version, it may not use these scripts for startup).
Make sure to use a valid and installed locale.
do
locale -a
choose in the list an installed locale which fits and says "utf8" in the name and add it to the script (for example) :
LC_CTYPE="en_US.utf8"; export LC_CTYPE
prior to starting Tomcat.

(in the above, I am assuming Unix/Linux; under Windows it may not be feasible).

One reason to be careful with this anyway, is that it may have unexpected consequences on other servlets. I believe this happens when the servlet itself is not specifying explicitly the encoding it uses for reading the request body or writing the response, and the JVM then defaults to the locale setting of the process that runs it and Tomcat.

In other words, in my opinion your solution above of setting this explicitly in your servlet is the better one.

Also make sure that all the html pages that you serve contain a tag like
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

If your html pages contain <form> tags, and you would like the browser to be nice and send you proper UTF-8 encoded form values when posting a form content, then add the following attributes to them, to try and convince the browser to do the right thing : <form .. method="POST" enctype="multipart/form-data" accept-charset="UTF-8">

And then, if you design and edit your html pages yourself, make sure that you use an editor that supports UTF-8, and save your pages as such.

And then, verify at the browser level (for example with Firefox and the LiveHttpHeaders extension), that the browser is effectively receiving a HTTP header like
Content-Type: text/html; charset=UTF-8
with every response from your server.

Paranoia : since you cannot trust the user nor his browser anyway, you may still want to add in your <form>s a hidden input field, containing a set value that is a known string in UTF-8 with some accented characters. Then in your application, you could check if you really received that string as expected. If not, then something unexpected happened with the form encoding, and you should reject the data. Something thus like :
<input type="hidden" value="ÁlélÜìÄ">
which will have a different "string length" depending on whether it is encoded as UTF-8 or iso-8859-1 (an "é" is 1 byte in iso-8859-1, but 2 bytes in Unicode/UTF-8).
That is not really paranoia, it's experience.

That was the practical bit. If you more general theorising, keep reading.

In general, for historical reasons mostly, the default charset/encoding for HTML and HTTP is ISO-8859-1 (latin-1). This is not always clear in all RFCs that contribute to various aspects of web applications however, so there is a certain amount of confusion. For example, the RFCs concerning HTML are quite clear (iso-8859-1 by default), while the RFCs concerning HTTP URIs are more vague or mutually contradictory. In any case, it is (unfortunately) not Unicode/UTF-8 everywhere by default, despite the hopes and beliefs of some web developers.

The fact that the internal Java charset is Unicode, and its default external charset/encoding is Unicode/UTF-8, tends to comfort some Java/Tomcat developers in the false belief that URLs also by default are UTF-8, while they are not (as far as I can determine, they are encoding-neutral).

Some people also believe that UTF-8 and iso-8859-1 are identical anyway for the first 256 Unicode code points, so it doesn't really matter. But this is also incorrect (only the first 128 codes overlap), and it does matter for anyone trying to build an application that is not purely English-speaking, as you have noticed.

And finally, there seems to be some confusion between a parameter that specifies a default encoding for Tomcat's internal processing of URIs, with the request body or response body encoding. There is also a parameter I believe that specifies something like "use the body encoding for the URL also" or vice-versa.

Add to this, that users can set up their browser in various ways, that they may have various keyboards and operating systems, that some browsers disregard what the server says about documents anyway and think they are smarter, and you get the situation that exists currently on the web, where half the time I cannot enter my first name in a web form and see it returned to me correctly in a response or an email. And I guess you may not be faring much better with your last name..

Tout cela ne simplifie pas les choses, mais...

The good news is that it appears to be improving over time, with correct UTF-8 support now in all browsers, and a tendency by web developers to specify UTF-8 explicitly wherever it's needed.
Which is many places, if you really want to get all the chips on your side.







---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to