Re: The design of Encoding class

Asger Alstrup Nielsen Thu, 25 Mar 1999 07:23:40 -0500
"Yu-Chung, Wang" wrote:
> After read the source code of Encoding class, I have kind of confuses.
> we get a 'LString' and return and 'wstring'. According to document, LString is used
> to represent document itself. However, I suppose convertToUnicode should be used
> when we read a byte stream from file or keyboard input (or input server of X) which 
>is
> variable length encoding (such as UTF8, BIG5, SJIS.....). The type of input string is
> LString instead of 'char *'. Because LString could be one byte (char *) or wide 
>character
> (wchar_t) according to the compile time option, how should we put variable length 
>encoding
> in LString? There's two possibility. Firstly, we can put one byte of input stream in 
>a single
> LChar no matter it's one-byte or two bye. Secondly, we can put one byte or two bytes 
>of input
> stream in a single LChar according to the encoding. It's a little bit wired in my 
>statement. I'll
> explain it by an example. For example, the byte stream is
>     0x40                         0xa1 0x40
>      first character           second character
> Suppose that we use wchar_t(16 bit). In the first method, the LString is
> 
>     0040,00a1,0040  ===> 48 bit
> in the latter method, it's
>     0040,a140          ====> 32bit
> 
> It seems that the second make more sense, but it means that LString must know the 
>encoding
> detail. This is not what we want because all endoing detail should not be dealed 
>with outside the
> Encoding class.
> 
> If we use first method, it wastes memory when we use 16bit character. Does it seems 
>that we have
> no good solution? Why? My answer is that we shouldn't use LString here. We should 
>use byte stream.
> I mean 'char *' here. The dataflow inside LyX is

Since we only need to convert the variable width encoding when we get
input from the keyboard, the memory overhead of using the first option
is insignificant.  And the argument is similar when we read from a
file:  We read a line at a time, and convert that, so the memory
overhead is minimal.

>         read system call                      convertToUnicode              LyX 
>algorithms
> file ===============> char *==================>LString================>LString
> 
> The disadvantage of this approach is that we need convert Unicode to screen font 
>encoding when
> we want to display. It's a true disadvantage. Therefore, another possibility is that
> 
>        read system call                      convertToLString                 LyX 
>algorithms
> file ================>char*==================>LString================>LString
> 
> In this approach, we don't need Unicode anymore and the LString could be send to X 
>function call
> directly because X support fix byte encoding for a lot of one to two bytes encoding.
> 
> Therefore, in my point of view, the design of Encoding seems that doesn't fill the 
>requirement? I hope
> I'm wrong.

The pipeline is this:

file (or keyboard input) in char * in variable encoding

         |
Step 1  \|/
         v

LString in variable encoding

         |
Step 2  \|/
         v

LString in fixed width encoding

         |
Step 3  \|/
         v

LString in font encoding

         |
Step 4  \|/
         v

char * or wchar_t * for font renderer.

Let's detail the steps:

Step 1:  This step is done line by line (or key by key), so the memory
overhead of switching to 16-bit LString even if the file is 8-bit is
insignificant.

Step 2:  This step is done by the encoding classes, and converts the
potentially variable width encoding to a fixed width encoding that is
used as the document representation.  
Optimally, this conversion will use Unicode as the middlelayer: First
convert to Unicode, and then to the appropriate fixed width encoding. 
Notice that it's entirely possible to skip the Unicode middle layer, if
a more effective encoding converter is written.  This step is not
performance critical.

Step 3: This step is done at display time.  We need to convert from the
fixed width enconding used for the document representation to the
encoding the font renderer uses. Notice that we chose the document
representation exactly the way we want in order to optimize this
process.  
In particular, we do not need to go over the Unicode middle layer to
perform this conversion.  So in practice, this can be as effective as
possible within the constraint that our document representation is fixed
width.

Step 4: This step is to come from LString to the C string the font
renderer wants to use.  If we use an 8-bit font renderer, but a wide
character LString, we have a performance bottleneck here, because we
have to down-copy the text to an 8-bit char array.  This is the primary
reason why we choose to make LString compile time variable sized. 
However, if our X server can handle a wide string, this step is constant
time.

The key insight is that the encoding conversions do not *have* to use
the Unicode as the middle layer.  If it is considered to be too slow, we
can omit this step with a hand-tuned converter.

The toUnicode and fromUnicode methods are not the primary encoding
conversion routines.  They are used internally in order to organize
things, and increase code reuse.  (And for the Unicode inset, as
described in the design document.)

In conclusion:  If we assume that the font renderer can handle both
8-bit and wide strings efficiently, the display process is optimal,
given the constraint that we have to use a fixed width document
encoding.  If the document representation matches the font encoding, we
have constant time display.

Alejandro, this is the primary reason for using LString rather than
vector as chunk representation:  With a vector, we do not have constant
time display, because we have to copy to an array first:  linear time.

Greets,

Asger
Re: The design of Encoding class

Reply via email to