Hi, You won't believe it, Windows neither provides an API to convert true multi-byte character streams On-The-Fly to Unicode nor an API to determine the number of bytes of such multi-byte character sequences. I do not speak about the double-byte charsets.
Let's say we receive an ansi-stream encoded with code page 50220 (iso-2020-jp). Those 7-bit encodings are still frequently used in emails or HTML in Far East. They use ESC-sequences to shift in or out another encoding mode. Example: array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, $42); The leading ESC-sequence "$1B, $24, $42" tells the decoder to treat the following bytes as double-byte characters ("$21, $41, $21, $41") and the trailing ESC-sequence shifts back to ASCII mode. This sample should translate to two Unicode code points "~~" correctly. It's easy to imagine what will happen if we do not pass the entire sequence to MultiByteToWideChar() but, for instance, split up in two chunks. Where the first chunk "$1B, $24, $42, $21, $41" should translate just fine, however the second "$21, $41, $1B, $28, $42" translates to garbage since there is no longer a leading ESC-sequence. AFAIK there are two possible solutions. 1.) Internet Explorer's (v5+) MLang.dll provides a much better API. ConvertInetMultibyteToUnicode() takes and returns a "Mode" value that must be initialized to zero on the first call. After converting the first chunk "Mode" returns a value <> 0. Passing this to convert the second chunk results in a correctly translated second chunk. ConvertInetMultibyteToUnicode() also returns the number of translated source bytes which is rather useful too. 2.) GNU library "iconv.dll" which is under LGPL. It's around 800 KB and natively available in Linux and MAC OS. It's similiar here, iconv uses some context-pointer to achieve the same. Both require passing around either the Mode or the context. So, if we want to fix current charset-bugs in ICS, some changes are required. It finally turned out that simply a CodePage parameter is not enough to always handle charset-works properly. IMO it's time to move on to another design, some custom TEncoding class most likely. Or maybe you have another idea? -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be