[twsocket] Charset conversion On-The-Fly

Arno Garrels Sat, 27 Mar 2010 13:52:35 -0700

Hi,

You won't believe it, Windows neither provides an API to convert true 
multi-byte character streams On-The-Fly to Unicode nor an API to determine 
the number of bytes of such multi-byte character sequences. I do not speak 
about the double-byte charsets.


Let's say we receive an ansi-stream encoded with code page 50220 
(iso-2020-jp). Those 7-bit encodings are still frequently used in emails or 
HTML in Far East. They use ESC-sequences to shift in or out another encoding 
mode.

Example:
array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, $42);
The leading ESC-sequence "$1B, $24, $42" tells the decoder to treat the 
following bytes as double-byte characters ("$21, $41, $21, $41") and the 
trailing ESC-sequence shifts back to ASCII mode. This sample should 
translate to two Unicode code points "～～" correctly.

It's easy to imagine what will happen if we do not pass the entire sequence 
to MultiByteToWideChar() but, for instance, split up in two chunks. Where 
the first chunk "$1B, $24, $42, $21, $41" should translate just fine, 
however the second "$21, $41, $1B, $28, $42" translates to garbage since 
there is no longer a leading ESC-sequence.

AFAIK there are two possible solutions.

1.) Internet Explorer's (v5+) MLang.dll provides a much better API.
ConvertInetMultibyteToUnicode() takes and returns a "Mode" value that must 
be initialized to zero on the first call. After converting the first chunk 
"Mode" returns a value <> 0. Passing this to convert the second chunk 
results in a correctly translated second chunk. 
ConvertInetMultibyteToUnicode() also returns the number of translated source 
bytes which is rather useful too.

2.) GNU library "iconv.dll" which is under LGPL.
It's around 800 KB and natively available in Linux and MAC OS. It's similiar 
here, iconv uses some context-pointer to achieve the same.

Both require passing around either the Mode or the context. So, if we want 
to fix current charset-bugs in ICS, some changes are required. It finally 
turned out that simply a CodePage parameter is not enough to always handle 
charset-works properly.

IMO it's time to move on to another design, some custom TEncoding class most 
likely.

Or maybe you have another idea?

--
Arno Garrels

--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

[twsocket] Charset conversion On-The-Fly

Reply via email to