[How to convert to and from Unicode] > For iso8859-x family, there's only a few glyphs in encodings. Therefore, it's > trivial. > But for a Asian language encoding, the stupid decision made before to spread one > language into the whole encoding space. Therefore, the conversion is not very > trivial. This is why I say we need much more memory space than iso8859 series. Notice that the situation is the same for iso-8859-x encodings: The upper glyphs are spread out into a large encoding space. So, the current implementation has two look-up maps: One that maps from iso-8859-x to Unicode glyph. Since we know that the interesting glyphs occupy 0xA0-0xFF in the iso-map, we can settle for one table conceptually indexed from 0xA0-0xFF that keeps the corresponding Unicode glyphs. You will have to adopt a more refined approach. Then we have another map that maps the Unicode glyph to a iso-8859-x glyph. Since we know that the codes 0x0020-x007F are mapped identically in iso-8859-x, we can just copy those. The rest of the Unicode glyphs are spread out over a large area, so we have to built a real map: First, we list all the relevant Unicode glyphs in a table, and put the corresponding iso-glyphs next to the Unicode glyph. Then, we sort these tables according to the Unicode glyph, and separate the Unicode glyph and the iso-8859-x glyph into two tables. Now, we can use binary search to look up the iso-8859-x glyph for any glyph. Granted, for Asian encodings, the number of glyphs is much larger, but conceptually it's almost the same situation. The main difference is that the first conversion from Asian encoding to Unicode requires two tables, because the encoding space of the Asian encoding is probably not continous like the iso-8859-x case. So you might need two real maps, instead of just one. Other than that, there is no difference, and I don't see why this should not be possible to implement. Binary search is efficient enough for the purposes we pursue: With 50,000 glyphs, it takes 15 comparisions to look up in the map. The memory consumption in bytes assuming non-continous Asian encoding space is eight times the number of glyphs in the encoding. > Maybe we can use dynamic loading as XFree86 to overcome this problem. > Becuase it's possible that we will have a log of different encoding in the future. > Usually, people need only few of them. I'd prefer that we wait with this. If it turns out to be a problem with these encoding converters, we will address the problem then. For now, let's keep things simple. > I'll try to make an encoding class for BIG5 definitely. In fact, most of my > question comes from the definition of encoding class. I don't think the > current definition of encoding is enough. Le me invent a possible usgae of > encoding class here. [Nice summary of the way to use the encoding converters.] > (6) when we need to save buffer, we need to convert from internal encoding to > file encoding. > > (7) But there's a problem in the above code. If the file encoding is a 8bit > encoding and we use 16bit version of LString. How could we save this string? You are right that we have to handle this explicitly. In particular, I propose that we provide four fixed conversion routines in StringTools.h: wstring toWString(LString); string toString(LString); LString toLString(wstring); LString toLString(string); Depending on the compile time option, these methods will either be constant time or linear time. Also, in the case of conversion from a wide encoding to a small one, we will definately loose information, but that's just too bad. Now, the save routine will be able to chose the right format to write the file in. (I will add a boolean flag to the encoding database that signifies whether an encoding is wide or not.) Greets, Asger