RE: problem with national language in html form input

Szegedi, Attila Mon, 19 Mar 2001 06:35:46 -0800
I have also done this once in my private copy of Tomcat, but have abandoned
it.
The problem is standards compliance, and standards (both the HTML standard
and the Servlet spec) are somewhat internationalization-ignorant on this
point.

Tomcat follows the HTML standard, which explicitly declares that MIME type
"application/x-www-form-urlencoded" is suitable ONLY for transferring ASCII
(but will of course work for ISO 8859-1 as well). See
http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
It says:

<citation>
"The content type "application/x-www-form-urlencoded" is inefficient for
sending large quantities of binary data or text containing non-ASCII
characters. The content type "multipart/form-data" should be used for
submitting forms that contain files, non-ASCII data, and binary data."
</citation>

So, if you want to comply with the HTML standard, you should force sending
all of your forms containing non-ASCII characters as "multipart/form-data"
using the "enctype" attribute of the form. Unfortunately, Tomcat will not
present "multipart/form-data" to your servlet as request parameters.

The HTML standard is further flawed in that it
1. defaults the encoding type of the form to
"application/x-www-form-urlencoded"
2. requires browsers to send form data in the same encoding they received
the HTML page in, (except if "accept-charset" attribute is set, which is
usually not).
So, a complying browser will by default use
"application/x-www-form-urlencoded" and send data through it in the same
encoding they received the HTML page in. The trouble is, that they wont send
the *ENCODING* back to the server in the Content-Type header (at least all
IE (up to 5.5) and NN (up to 4.75) won't). It will always be
"application/x-www-form-urlencoded" and not
"application/x-www-form-urlencoded; charset=whatever"), so Tomcat's
parsePostData can't determine the charset, it will always sense ISO 8859-1,
as this is the default.

I have some back experience working with Microsoft's ASP technology. They
solved the problem partially by introducing the "session encoding" -- all
HTML response used this encoding, and all request parameters were parsed
according to that encoding.

This could be a solution, however it should go into servlet spec. (Are we
heard, servlet spec people?)

My own app uses ISO 8859-2 (as it's in Hungarian), and for now I just
transcode 8859-1 into 8859-2. I'm lucky I use Model2 paradigm, so I have a
single servlet handling all requests and a single central place to transcode
request parameters.

Cheers,
  Attila.

> -----Original Message-----
> From: Aleksandras Novikovas [mailto:[EMAIL PROTECTED]]
> Sent: Friday, March 16, 2001 10:32 AM
> To: '[EMAIL PROTECTED]'
> Subject: problem with national language in html form input
>
>
> Hello All,
>
> I'm posting for the first time, so please inform me if I do
> something wrong ...
>
> First of all - problem description :
> I have application in multilanguage (where user can
> dynamically change charset).
> Problem rises when user enters information in selected language.
> After parsePostData in HttpUtils I get lots of "????" instead of text.
> I can not rely on default system encoding, because
> application has ability to add the languages dynamically
> without recompilation.
> So I never know what next encoding system will need.
>
> I have written some code to work around this problem and
> think it would be nice to have it standard package.
> Actually I've changed parsePostData - added  encoding parameter.
> Now programmer could choose in what encoding InputStream is supplied.
> I have tested it with windows-1257 (Baltic) and windows-1251
> (Cyrylic) - for me it worked.
> If someone find any errors - please let me know.
> Here is code of that method :
>
> //////////////////////////////////////////////////////////////
> //////////////////
> // Parses data from an HTML form that the client sends to
> // the server using the HTTP POST method and the
> // <i>application/x-www-form-urlencoded</i> MIME type.
> //
> // <p>The data sent by the POST method contains key-value
> // pairs. A key can appear more than once in the POST data
> // with different values. However, the key appears only once in
> // the hashtable, with its value being
> // an array of strings containing the multiple values sent
> // by the POST method.
> //
> // <p>The keys and values in the hashtable are stored in their
> // decoded form, so
> // any + characters are converted to spaces, and characters
> // sent in hexadecimal notation (like <i>%xx</i>) are
> // converted to specified encoding.
> //
> // @param len an integer specifying the length,
> //                            in characters, of the
> //                            <code>ServletInputStream</code>
> //                            object that is also passed to this
> //                            method
> // @param in  the <code>ServletInputStream</code>
> //                            object that contains the data sent
> //                            from the client
> // @param enc a String specifying the character encoding
> //                            of the <code>ServletInputStream</code>
> //                            object
> //
> // @return            a <code>HashTable</code> object built
> //                            from the parsed key-value pairs
> //
> // @exception IllegalArgumentException        if the data
> //                            sent by the POST method is invalid
> //////////////////////////////////////////////////////////////
> //////////////////
>
> public Hashtable parsePostData (int len, ServletInputStream
> in, String enc)
> {
>       // XXX
>       // should a length of 0 be an IllegalArgumentException
>
>       if (len <=0)
>           return new Hashtable (); // cheap hack to return an
> empty hash
>
>       if (in == null) {
>           throw new IllegalArgumentException ();
>       }
>
>       // Make sure we read the entire POSTed body.
>       byte [] postedBytes = new byte [len];
>       try {
>               int offset = 0;
>               do {
>                       int inputLen = in.read (postedBytes,
> offset, len - offset);
>                       if (inputLen <= 0) {
>                               throw new
> IllegalArgumentException (lStrings.getString("err.io.short_read"));
>                       }
>                       offset += inputLen;
>               } while ((len - offset) > 0);
>       }
>       catch (IOException e) {
>               throw new IllegalArgumentException (e.getMessage ());
>       }
>
>       // Here some changes ...
>       // Direct parsing of postedBytes, converting to
>       // desired unicode symbol and forming final string
>
>       StringBuffer sb = new StringBuffer ();
>       Integer unicodeInteger;
>       for (int i = 0; i < postedBytes.length - 1; i++) {
>               String testString = new String (postedBytes, i, 1);
>               switch (testString.charAt (0)) {
>                       case '+' :
>                               sb.append (' ');
>                               break;
>                       case '%' :
>                               try {
>                                       // Here is actual
> conversion to unicode
>                                       unicodeInteger =
> Integer.valueOf (new String (postedBytes, i + 1, 2), 16);
>                                       sb.append (new String
> (new byte [] {unicodeInteger.byteValue ()}, enc));
>                                       i += 2;
>                               }
>                               catch (NumberFormatException e) {
>                                       throw new
> IllegalArgumentException ();
>                               }
>                               catch (UnsupportedEncodingException e) {
>                                       throw new
> IllegalArgumentException ();
>                               }
>                               catch
> (ArrayIndexOutOfBoundsException e) {
>                                       // This can happen only
> at the end of stream
>                                       // So just add the rest
> and stop loop
>                                       String rest = new
> String (postedBytes, i, postedBytes.length - i);
>                                       sb.append (rest);
>                                       i += rest.length ();
>                               }
>                               break;
>                       default:
>                               // Here do not use encodintg
>                               // It is expected, that request
> is sent in
>                               sb.append (new String
> (postedBytes, i, 1));
>                               break;
>               }
>       }
>       return (parseQueryString (sb.toString ()));
> }
>
>
> Best regards,
> Aleksandras Novikovas [EMAIL PROTECTED]
> IT manager
> Baltic Logistic System Vilnius Ltd.
> Kirtumu 51, Vilnius, Lithuania
> Phone: +370-2-390874; FAX: +370-2-390899; Mobile: +370-99-21678
>
>
>
>
RE: problem with national language in html form input

Reply via email to