Francois PIETTE wrote: > In which context do you have such a need of on-the-fly conversion ? > Are you trying to display an email content while it is being > transfered ?
It's required in TMimeDec for example, the parser reads from stream into a buffer of fixed length. So it is possible that at the end of buffer there are _any_ number of non-translatable bytes. We need to be able to detect such invalid bytes, otherwise garbage is decoded. We also need a reliable CharNext-function in order to not unintentionally break a byte sequence. This is required, for instance, in TSmtpCli when the component has to fold header lines or wrap message text. >> array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, >> $42); The leading ESC-sequence "$1B, $24, $42" tells the decoder to >> treat the following bytes as double-byte characters ("$21, $41, $21, >> $41") and the trailing ESC-sequence shifts back to ASCII mode. This >> sample should translate to two Unicode code points "~~" correctly. >> >> It's easy to imagine what will happen if we do not pass the entire >> sequence >> to MultiByteToWideChar() but, for instance, split up in two chunks. >> Where the first chunk "$1B, $24, $42, $21, $41" should translate >> just fine, however the second "$21, $41, $1B, $28, $42" translates >> to garbage since there is no longer a leading ESC-sequence. > > The leading ESC-sequence "$1B, $24, $42" is always the same ? No, there are multiple different ESC-sequences per charset with variable length and some even shift in to three-byte character mode, this was just one example. Further more, MultiByteToWideChar() and WideCharToMultiByte() do not work with those charsets correctly they are buggy! The ConvertINet-API works around this by first convert these strings to one of their corresponding native Windows charsets internally, in my sample to DBCS Windows-932. Have a look here: http://source.winehq.org/source/dlls/mlang/mlang.c However the implementation in WINE is _wrong_ and incomplete, it handles Japanese only. My sample above as two Unicode code points: UStr := #$FF5E#$FF5E; Try to convert this string with WideCharToMultiByte() to ansi code page 50220, the result is two question marks "??". Both MLang's ConvertINetUnicodeToMultybyte() and iconv give the correct result. > At first glance, it is not difficult to implement a conversion > routine based on MultiByteToWideChar by prefixing the next chunk with > the same leading ESC-sequence we would have detected in the previous > chunk. Implementation could mimic ConvertInetMultibyteToUnicode or be > encapsulated in a class, for example a stream like class. Yep, that was my first idea as well and the reason why I looked at the WINE source code. I already translated parts of their mlang.c to Delphi, but as said above their implementation is buggy and incomplete. -- Arno Garrels > > > -- > francois.pie...@overbyte.be > The author of the freeware multi-tier middleware MidWare > The author of the freeware Internet Component Suite (ICS) > http://www.overbyte.be > > > ----- Original Message ----- > From: "Arno Garrels" <arno.garr...@gmx.de> > To: "ICS support mailing" <twsocket@elists.org> > Sent: Saturday, March 27, 2010 10:52 PM > Subject: [twsocket] Charset conversion On-The-Fly > > >> Hi, >> >> You won't believe it, Windows neither provides an API to convert true >> multi-byte character streams On-The-Fly to Unicode nor an API to >> determine the number of bytes of such multi-byte character >> sequences. I do not speak about the double-byte charsets. >> >> Let's say we receive an ansi-stream encoded with code page 50220 >> (iso-2020-jp). Those 7-bit encodings are still frequently used in >> emails or >> HTML in Far East. They use ESC-sequences to shift in or out another >> encoding >> mode. >> >> Example: >> array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, >> $42); The leading ESC-sequence "$1B, $24, $42" tells the decoder to >> treat the following bytes as double-byte characters ("$21, $41, $21, >> $41") and the trailing ESC-sequence shifts back to ASCII mode. This >> sample should translate to two Unicode code points "~~" correctly. >> >> It's easy to imagine what will happen if we do not pass the entire >> sequence >> to MultiByteToWideChar() but, for instance, split up in two chunks. >> Where the first chunk "$1B, $24, $42, $21, $41" should translate >> just fine, however the second "$21, $41, $1B, $28, $42" translates >> to garbage since there is no longer a leading ESC-sequence. >> >> AFAIK there are two possible solutions. >> >> 1.) Internet Explorer's (v5+) MLang.dll provides a much better API. >> ConvertInetMultibyteToUnicode() takes and returns a "Mode" value >> that must be initialized to zero on the first call. After converting >> the first chunk "Mode" returns a value <> 0. Passing this to convert >> the second chunk results in a correctly translated second chunk. >> ConvertInetMultibyteToUnicode() also returns the number of translated >> source >> bytes which is rather useful too. >> >> 2.) GNU library "iconv.dll" which is under LGPL. >> It's around 800 KB and natively available in Linux and MAC OS. It's >> similiar >> here, iconv uses some context-pointer to achieve the same. >> >> Both require passing around either the Mode or the context. So, if >> we want to fix current charset-bugs in ICS, some changes are >> required. It finally turned out that simply a CodePage parameter is >> not enough to always handle charset-works properly. >> >> IMO it's time to move on to another design, some custom TEncoding >> class most >> likely. >> >> Or maybe you have another idea? >> >> -- >> Arno Garrels >> >> -- >> To unsubscribe or change your settings for TWSocket mailing list >> please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket >> Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be