Hello Arno,
In which context do you have such a need of on-the-fly conversion ? Are you
trying to display an email content while it is being transfered ?
array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, $42);
The leading ESC-sequence "$1B, $24, $42" tells the decoder to treat the
following bytes as double-byte characters ("$21, $41, $21, $41") and the
trailing ESC-sequence shifts back to ASCII mode. This sample should
translate to two Unicode code points "~~" correctly.
It's easy to imagine what will happen if we do not pass the entire
sequence
to MultiByteToWideChar() but, for instance, split up in two chunks. Where
the first chunk "$1B, $24, $42, $21, $41" should translate just fine,
however the second "$21, $41, $1B, $28, $42" translates to garbage since
there is no longer a leading ESC-sequence.
The leading ESC-sequence "$1B, $24, $42" is always the same ? Is there
different leading ESC-sequences ?
At first glance, it is not difficult to implement a conversion routine based
on MultiByteToWideChar by prefixing the next chunk with the same leading
ESC-sequence we would have detected in the previous chunk. Implementation
could mimic ConvertInetMultibyteToUnicode or be encapsulated in a class, for
example a stream like class.
--
francois.pie...@overbyte.be
The author of the freeware multi-tier middleware MidWare
The author of the freeware Internet Component Suite (ICS)
http://www.overbyte.be
----- Original Message -----
From: "Arno Garrels" <arno.garr...@gmx.de>
To: "ICS support mailing" <twsocket@elists.org>
Sent: Saturday, March 27, 2010 10:52 PM
Subject: [twsocket] Charset conversion On-The-Fly
Hi,
You won't believe it, Windows neither provides an API to convert true
multi-byte character streams On-The-Fly to Unicode nor an API to determine
the number of bytes of such multi-byte character sequences. I do not speak
about the double-byte charsets.
Let's say we receive an ansi-stream encoded with code page 50220
(iso-2020-jp). Those 7-bit encodings are still frequently used in emails
or
HTML in Far East. They use ESC-sequences to shift in or out another
encoding
mode.
Example:
array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, $42);
The leading ESC-sequence "$1B, $24, $42" tells the decoder to treat the
following bytes as double-byte characters ("$21, $41, $21, $41") and the
trailing ESC-sequence shifts back to ASCII mode. This sample should
translate to two Unicode code points "~~" correctly.
It's easy to imagine what will happen if we do not pass the entire
sequence
to MultiByteToWideChar() but, for instance, split up in two chunks. Where
the first chunk "$1B, $24, $42, $21, $41" should translate just fine,
however the second "$21, $41, $1B, $28, $42" translates to garbage since
there is no longer a leading ESC-sequence.
AFAIK there are two possible solutions.
1.) Internet Explorer's (v5+) MLang.dll provides a much better API.
ConvertInetMultibyteToUnicode() takes and returns a "Mode" value that must
be initialized to zero on the first call. After converting the first chunk
"Mode" returns a value <> 0. Passing this to convert the second chunk
results in a correctly translated second chunk.
ConvertInetMultibyteToUnicode() also returns the number of translated
source
bytes which is rather useful too.
2.) GNU library "iconv.dll" which is under LGPL.
It's around 800 KB and natively available in Linux and MAC OS. It's
similiar
here, iconv uses some context-pointer to achieve the same.
Both require passing around either the Mode or the context. So, if we want
to fix current charset-bugs in ICS, some changes are required. It finally
turned out that simply a CodePage parameter is not enough to always handle
charset-works properly.
IMO it's time to move on to another design, some custom TEncoding class
most
likely.
Or maybe you have another idea?
--
Arno Garrels
--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be
--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be