Encoding and changes in buffers

cmanolache Fri, 25 May 2001 13:19:56 -0700
Hi,

A long mail, kind of important if you are interested in encodings. I'll
get most part checked in as soon as I finish running the tests and
removing the debug statements ( probably this weekend - I have a week of
vacation after that, with little computer time )

As you know, charsets and bytes are one of my biggest obsessions
:-) I've been trying to fix this for a long time, much longer than I
ever expected - but I think I'm very close.

I am able to get POST and GET corectly decode in the charset. I cut and
paste some japanese letters and the result looked very close to the
original, same for 8859-2  ( except that I can only look at the char
codes, my linux still doesn't display them right ).

I also did a lot of cleanup and optimizations in the byte->char
conversion, similar with what we have on char->byte. This will be very
important for getReader() and POST performance.

In order to implement that, I had to extend the Byte/CharChunks with code
to allow them to own and resize a buffer ( the code that used to be in
OutputBuffer and was moved to ByteBuffer ). This is also great because it
makes the Chunks much more flexible and powerfull ( i.e. less memcopy ).


On a different front, I've added a new hook ( postReadRequest ), identical
with the hook with the same name in apache. It'll be invoked just after
the request is received and before contextMap, and will be used to 
do all the decoding, save the undecoded request for facade ( the servlet
API requires us to return it decoded, while intenally we need to decode
for mapping ). 

A new module that implements this hook will be added and used to detect
the charset and do the conversion, if possible, and this will allow people
to plug their own charset decoding mechanisms. 

Extracting the session id from request will also move to this stage,
eliminating one more potential configuration problem.

The biggest problem remain of course detecting the right charset - none of
the browsers ( except lynx ) sends the charset. The good news is that all
browsers are sending POST/GET with the same encoding as the page that 
contains the form, and most browsers support a UTF8 - that means you can
get something consistent working.

Unfortunately, the servlet spec requires us to use 8859-1 for default, and
we can't change that.

All I can do about this problem is making sure it'll be very easy for
people to plug their own module that detects encoding, and eventually
configure the default per application. ( note that servlet 2.3 doesn't
solve this problem too much - the user can set the charset, assuming it
knows it - but if it would be easy to guess the charset we'll not have
this problem ). There are many tricks that could be used to solve the
problem, including per session charset, etc - and I'll try to get as many
as possible, but you'll have to enable the code as the default will be
to use the charset= ( which we know doesn't work in any case).

If you have ideas - please let me know. With a bit of luck, 3.3 might
start to have some very good support for non-latin1 apps.

Costin
Encoding and changes in buffers

Reply via email to