Re: The design of the Encoding class

Asger Alstrup Nielsen Fri, 2 Apr 1999 03:37:47 -0500
[How to convert to and from Unicode]
> For iso8859-x family, there's only a few glyphs in encodings. Therefore, it's
> trivial.
> But for a Asian language encoding, the stupid decision made before to spread
one
> language into the whole encoding space. Therefore, the conversion is not very
> trivial. This is why I say we need much more memory space than iso8859
series.

Notice that the situation is the same for iso-8859-x encodings:  The upper
glyphs are spread out into a large encoding space.  So, the current
implementation has two look-up maps:  

One that maps from iso-8859-x to Unicode glyph.  Since we know that the
interesting glyphs occupy 0xA0-0xFF in the iso-map, we can settle for one table
conceptually indexed from 0xA0-0xFF that keeps the corresponding Unicode
glyphs.  You will have to adopt a more refined approach.

Then we have another map that maps the Unicode glyph to a iso-8859-x glyph. 
Since we know that the codes 0x0020-x007F are mapped identically in iso-8859-x,
we can just copy those.  The rest of the Unicode glyphs are spread out over a
large area, so we have to built a real map:  First, we list all the relevant
Unicode glyphs in a table, and put the corresponding iso-glyphs next to the
Unicode glyph.  Then, we sort these tables according to the Unicode glyph, and
separate the Unicode glyph and the iso-8859-x glyph into two tables.

Now, we can use binary search to look up the iso-8859-x glyph for any glyph.

Granted, for Asian encodings, the number of glyphs is much larger, but
conceptually it's almost the same situation.  The main difference is that the
first conversion from Asian encoding to Unicode requires two tables, because
the encoding space of the Asian encoding is probably not continous like the
iso-8859-x case.  
So you might need two real maps, instead of just one.  Other than that, there
is no difference, and I don't see why this should not be possible to implement.
Binary search is efficient enough for the purposes we pursue:  With 50,000
glyphs, it takes 15 comparisions to look up in the map.
The memory consumption in bytes assuming non-continous Asian encoding space is
eight times the number of glyphs in the encoding.  

> Maybe we can use dynamic loading as XFree86 to overcome this problem.
> Becuase it's possible that we will have a log of different encoding in the
future.
> Usually, people need only few of them.

I'd prefer that we wait with this.  If it turns out to be a problem with these
encoding converters, we will address the problem then.  For now, let's keep
things simple.

> I'll try to make an encoding class for BIG5 definitely. In fact, most of my
> question comes from the definition of encoding class. I don't think the 
> current definition of encoding is enough. Le me invent a possible usgae of 
> encoding class here.

[Nice summary of the way to use the encoding converters.]

> (6) when we need to save buffer, we need to convert from internal encoding to

> file encoding.
>
> (7) But there's a problem in the above code. If the file encoding is a 8bit
> encoding and we use 16bit version of LString. How could we save this string? 

You are right that we have to handle this explicitly.  In particular, I propose
that we provide four fixed conversion routines in StringTools.h:

wstring toWString(LString);
string toString(LString);
LString toLString(wstring);
LString toLString(string);

Depending on the compile time option, these methods will either be constant
time or linear time.  Also, in the case of conversion from a wide encoding to a
small one, we will definately loose information, but that's just too bad.

Now, the save routine will be able to chose the right format to write the file
in.  (I will add a boolean flag to the encoding database that signifies whether
an encoding is wide or not.)

Greets,

Asger
Re: The design of the Encoding class

Reply via email to