Re: [twsocket] Charset conversion On-The-Fly

Arno Garrels Sun, 28 Mar 2010 00:33:42 -0700

Francois PIETTE wrote:

> In which context do you have such a need of on-the-fly conversion ?
> Are you trying to display an email content while it is being
> transfered ?


It's required in TMimeDec for example, the parser reads from stream into 
a buffer of fixed length. So it is possible that at the end of buffer there
are  _any_ number of non-translatable bytes. We need to be able to
detect such invalid bytes, otherwise garbage is decoded.

We also need a reliable CharNext-function in order to not unintentionally
break a byte sequence. This is required, for instance, in TSmtpCli when 
the component has to fold header lines or wrap message text.
 
>> array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28,
>> $42); The leading ESC-sequence "$1B, $24, $42" tells the decoder to
>> treat the following bytes as double-byte characters ("$21, $41, $21,
>> $41") and the trailing ESC-sequence shifts back to ASCII mode. This
>> sample should translate to two Unicode code points "～～" correctly.
>> 
>> It's easy to imagine what will happen if we do not pass the entire
>> sequence
>> to MultiByteToWideChar() but, for instance, split up in two chunks.
>> Where the first chunk "$1B, $24, $42, $21, $41" should translate
>> just fine, however the second "$21, $41, $1B, $28, $42" translates
>> to garbage since there is no longer a leading ESC-sequence.
> 
> The leading ESC-sequence "$1B, $24, $42" is always the same ?

No, there are multiple different ESC-sequences per charset with 
variable length and some even shift in to three-byte character mode,
this was just one example.

Further more, MultiByteToWideChar() and WideCharToMultiByte() do
not work with those charsets correctly they are buggy!
 
The ConvertINet-API works around this by first convert these strings to
one of their corresponding native Windows charsets internally, in my 
sample to DBCS Windows-932.
Have a look here: http://source.winehq.org/source/dlls/mlang/mlang.c 
However the implementation in WINE is _wrong_ and incomplete,
it handles Japanese only.

My sample above as two Unicode code points:

UStr := #$FF5E#$FF5E;

Try to convert this string with WideCharToMultiByte() to ansi code page 
50220, the result is two question marks "??". Both MLang's 
ConvertINetUnicodeToMultybyte() and iconv give the correct result.

> At first glance, it is not difficult to implement a conversion
> routine based on MultiByteToWideChar by prefixing the next chunk with
> the same leading ESC-sequence we would have detected in the previous
> chunk. Implementation could mimic ConvertInetMultibyteToUnicode or be
> encapsulated in a class, for example a stream like class.

Yep, that was my first idea as well and the reason why I looked at the 
WINE source code. I already translated parts of their mlang.c to Delphi,
but as said above  their implementation is buggy and incomplete.

--
Arno Garrels


> 
> 
> --
> francois.pie...@overbyte.be
> The author of the freeware multi-tier middleware MidWare
> The author of the freeware Internet Component Suite (ICS)
> http://www.overbyte.be
> 
> 
> ----- Original Message -----
> From: "Arno Garrels" <arno.garr...@gmx.de>
> To: "ICS support mailing" <twsocket@elists.org>
> Sent: Saturday, March 27, 2010 10:52 PM
> Subject: [twsocket] Charset conversion On-The-Fly
> 
> 
>> Hi,
>> 
>> You won't believe it, Windows neither provides an API to convert true
>> multi-byte character streams On-The-Fly to Unicode nor an API to
>> determine the number of bytes of such multi-byte character
>> sequences. I do not speak about the double-byte charsets.
>> 
>> Let's say we receive an ansi-stream encoded with code page 50220
>> (iso-2020-jp). Those 7-bit encodings are still frequently used in
>> emails or
>> HTML in Far East. They use ESC-sequences to shift in or out another
>> encoding
>> mode.
>> 
>> Example:
>> array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28,
>> $42); The leading ESC-sequence "$1B, $24, $42" tells the decoder to
>> treat the following bytes as double-byte characters ("$21, $41, $21,
>> $41") and the trailing ESC-sequence shifts back to ASCII mode. This
>> sample should translate to two Unicode code points "～～" correctly.
>> 
>> It's easy to imagine what will happen if we do not pass the entire
>> sequence
>> to MultiByteToWideChar() but, for instance, split up in two chunks.
>> Where the first chunk "$1B, $24, $42, $21, $41" should translate
>> just fine, however the second "$21, $41, $1B, $28, $42" translates
>> to garbage since there is no longer a leading ESC-sequence.
>> 
>> AFAIK there are two possible solutions.
>> 
>> 1.) Internet Explorer's (v5+) MLang.dll provides a much better API.
>> ConvertInetMultibyteToUnicode() takes and returns a "Mode" value
>> that must be initialized to zero on the first call. After converting
>> the first chunk "Mode" returns a value <> 0. Passing this to convert
>> the second chunk results in a correctly translated second chunk.
>> ConvertInetMultibyteToUnicode() also returns the number of translated
>> source
>> bytes which is rather useful too.
>> 
>> 2.) GNU library "iconv.dll" which is under LGPL.
>> It's around 800 KB and natively available in Linux and MAC OS. It's
>> similiar
>> here, iconv uses some context-pointer to achieve the same.
>> 
>> Both require passing around either the Mode or the context. So, if
>> we want to fix current charset-bugs in ICS, some changes are
>> required. It finally turned out that simply a CodePage parameter is
>> not enough to always handle charset-works properly.
>> 
>> IMO it's time to move on to another design, some custom TEncoding
>> class most
>> likely.
>> 
>> Or maybe you have another idea?
>> 
>> --
>> Arno Garrels
>> 
>> --
>> To unsubscribe or change your settings for TWSocket mailing list
>> please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
>> Visit our website at http://www.overbyte.be
--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Re: [twsocket] Charset conversion On-The-Fly

Reply via email to