Re: UTF-8 indexing and searching

Paul Libbrecht Fri, 01 Jul 2005 13:58:10 -0700

Careful that in the http world, there's an amibuity:x-www-form-url-encoded does not specify the content-encoding that thebyts represented in the %-escaped sequences are written with.

That's fixed by the very recent URI spec where absence means utf-8...

My experience was that Tomcat simply converted the bytes of this intothe first bytes of the 16-bit unicode, therefore working withiso-8859-1.We succeeded receiving forms from pages utf-8-encded by packing aninputstreamreader in utf-8 at the end of an inputstream that reads thebytes of the string of request.getParam...


Hope that helps.

paul



Le 1 juil. 05, à 22:41, <[EMAIL PROTECTED]> a écrit :


Did you check that the request string you get at the analyzer
level is corectly encoded as UTF-8?
We had the same problem with french accentuated char encoded
also as UTF-8, and transmited by tomcat as ISO-8859-1. It was
just for a test, also we didn't investgated a lot, but
re-encode in URL/ISO-8859-1 and re-decode from URL in correct
UTF-8, and it worked.
Don't know, if it may help you ...


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 indexing and searching

Reply via email to