Re: Design documents - strings and encodings

Asger Alstrup Nielsen Mon, 8 Mar 1999 04:38:43 -0500
> >  In other words, the encoding of a document is a document level property,
> >  not a paragraph level property.
> 
> Agreed.  The question is "Are we going to allow \include{}-ing
> buffers in other encodings than the parent?"  LaTeX exports will
> be affected by this since the preamble writer must know what
> encodings to be used in the whole document.  (can use signal/slot)

If LaTeX can handle it, I think we should allow this.  It should not be too big
a problem to realize:  Do a descent of the children and collect a list of their
encodings.

> > why we don't just use the wstring for document representation and just get 
> > rid of this strange LString altogether?
> > The answer is threefold: Performance in speed, performance in memory use,
> > and semantic clarity!
> 
> I'm not against your design but I would like a clarification.
> Could you elaborate a bit more on why the distinction of wstring
> and LString is necessary?

It is not necessary.  It's convenient, and I feel good design.  Let me iterate
a bit more about why I think LString is a good idea:

Yes, we could probably get things working really well with just wstring.  

As explained, we have a performance advantage with LString at least when it
comes to Latin-family encodings, because we do not have to convert from a
16-bit encoding to an 8-bit encoding at display time.  (Excuse my ignorance
about Asian encodings.)
Notice that the main thing we want to optimize in LyX is the displaying.  This
is only area in LyX where we have special data structures only designed for
speeding things up, and this with a reason.  If the displaying of text is slow,
LyX will seem sluggish nomatter how fast other operations are.  So optimizing
the display speed is paramount.

Now you can argue that the conversion is only linear time if we use a static
buffer to avoid allocating memory, and the real drawing of the text will
probably take many orders longer.  Hey, the X server might even be doing
encoding conversions behing our back, so why worry about one more layer that
can be heavily optimized?
This is all true, and therefore I agree that the speed argument is weak in
itself.
(One observation though:  Since we can chose our own encoding with LString, we
can actually *avoid* conversions behind our back:  You mention below that the X
server will convert many encodings to something else automatically.  Notice
that we can avoid this overhead by serving the X server the encoding it really
wants.  Then neither LyX nor the X server needs to convert any encodings at
display time.  With LString, we are more flexible to match the given X server. 
As an important special case, the LString can be encoded in Unicode.)

Similar the memory consumption argument is also weak:  A factor of two in
memory savings for the document representation for only Latin-encoding people
is not worth pursuing.  This is correct.  In itself, we can afford to use the
double memory for the raw document representation.  There are so many other
bloated data structures in LyX that the impact of this increase in memory usage
is managable.  For instance, the undo feature is implemented by *duplicating*
all the affected paragraphs, and clearly, we could gain a lot of memory by
making a smarter undo.  Also, the Paragraph structure is severly bloated with
lots of duplicate information.  And seriously, the memory usage of LyX has
never been reported as a problem.
So, the memory argument is also weak on a techical scale.

Finally, the semantic clarity argument is inheritly weak.  I'm talking about
something that we can probably never measure in a concrete way.  In short, I'm
stating that just because we use the name LString for document representation
strings, and nothing else, we will have fewer bugs!
This can be hard to believe, but never the less I feel this is the strongest
argument:  Because the semantic intent of the string is visible in the name
itself, I hope the developers will be more aware of the issues involved with
encodings and write more correct code.

Summing up:  I present three arguments to argue that it's a good idea to use
LString, next to string and wstring.  Each of these arguments is relatively
weak in itself, but the sum of them is strong, IMO.
Using LString gives us more flexibility performance wise, and clearer
semantics.

> >  Then painting the text on the screen would be a very slow operation: Every
> >  time a word should be presented, it had to be converted to the encoding
> >  of the font we use.
> 
> This depends on what font loader/server we choose.

Yes, obviously.  Let me just say that the LString approach is a super-set of
the wstring approach:  LString can be encoded in Unicode, if this is
advantageous for the X server.
If we only have Unicode encoding, all Latin families would require a conversion
that is otherwise unnecessary.

> More important may be the speed of character level operations you
> mentioned earlier:
> 
> > As the medium to perform generic character level operations.
> 
> So, unless explicitly optimized methods are implemented for the
> encoding you happen to be using, to/from conversion between
> Unicode is performed every time character level operations are
> performed.  And you do not want to implement \( n \) sets of
> methods for all encodings.  hmm...

Notice that these character level operations are NOT speed critical.  We hardly
ever convert a lower case letter to an upper case one, and when we do, it's
only small quantities of text.  So in practically all cases, there is no
problem with having to go over the Unicode medium.
The only character level operation that might be speed critical is the
composition operation:  To learn what components a given glyph is composed of,
in the case that we can't display it directly.  Then we want to emulate it, as
we do with the LaTeX Accent insets we have now.  In practice, we'd ask the
Unicode database what a given glyph is composed of, and if possible draw all
those components on top of each others to emulate the glyph.  I expect this to
be a rare situation.
So I don't anticipate that we will need to optimize these character conversions
for any encodings, but it's nice to know that it is an option if it's turns out
that it is needed for some strange encoding.  For instance, I'm not positive
that composition of glyphs is not much more common in Asian encodings.  In that
case, the corresponding encodings could implement the composition operation
efficiently, if it turned out to be a problem.
The key is that we can optimize *selectively*.  If we find a bottleneck, we can
attack that bottleneck very precisely.  I hope it turns out that there is no
bottleneck going over Unicode, and then we will have obtained the claimed
saving in code size, and simplicity.

> > The second reason we have LString involves memory consumption.
> 
> What is the rationale for the fixed width encoding here?  Of course
> for a variable width encodings, the design of string class must be
> something different from the standard string, probably something like
> linked list type one.  But considering that accented characters are
> less used than non-accented characters, 8-bit people will not experience
> the double memory bloat if UTF8 rather than UTF16 is used.  Wait a minute,
> I realized that we 16-bit users will experience the 3/2 times memory
> bloat if UTF8 is adopted.

You mention the key yourself.  
Let me just repeat it:  For latin people, 8-bits are enough.  Thus, UTF would
be more costly, although not much.  For 16-bit people, UTF is more expensive.  
Variant encodings are not memory conserving, and the trouble involved with a
variant encoding is not worth it.  
Yes, we can always emulate a fixed width encoding with a clever string class
and provide random access, but the performance would be very bad.
So variant encodings are only interesting as a storage medium, and of course
LyX will support this.

> > The last, and more subtle, reason to introduce LString is semantic clarity.
> 
> This I think is a good point.

Thank you.

> I like this and the subsequent subsection.  It will be really a headache
> to support a state dependent variable length encoding as a medium of
> storage or manipulations.

So, in conclusion, I think you mostly agree that the design is sound, and
complete enough to support the needs for the Asian encodings?

> > The information needed for this are available in the Unicode glyph database
> > which can be found somewhere on the net, try www.unicode.org.
> 
> The database is supplied as text files suitable for processing with
> scripting languages.  We can wait Friday and designate Amir Karger as
> the coder.

Good idea.

Greets,

Asger
Re: Design documents - strings and encodings

Reply via email to