Re: Unicode handling

Larry Wall Tue, 27 Mar 2001 10:49:04 -0800
Dan Sugalski writes:
: At 07:21 AM 3/27/2001 -0800, Larry Wall wrote:
: >Dan Sugalski writes:
: >Assume that in practice most of the normalization will be done by the
: >input disciplines.  Then we might have a pragma that says to try to
: >enforce level 1, level 2, level 3 if your data doesn't match your
: >expectations.  Then hopefully the expected semantics of the operators
: >will usually (I almost said "normally" :-) match the form of the data
: >coming in, and forced conversions will be rare.
: 
: The only problem with that is it means we'll be potentially altering the 
: data as it comes in, which leads back to the problem of input and output 
: files not matching for simple filter programs. (Plus it means we spend CPU 
: cycles altering data that we might not actually need to)

I think the programmer will often know which files are already
normalized, and can just be slurped in with a :raw discipline or some
such.  Whether that can be a default filter policy is really a matter
that depends on how the OS handles things.

It's almost more important to know what form the programmer wants for
the output.  I don't think, by and large, that people will be interested
in producing files that are of mixed normalization.  On input you take
what you're given, but on output, you have to make a policy decision.
That's likelier to be consistent for a whole program (or at least a whole
lexical scope), so we can probably have a declaration for the preferred
output form, and default everything that way.

This is somewhat orthogonal to the issue of laziness, however.  To the
first approximation it doesn't really matter when we do the conversion, as
long as the user sees a consistent semantics.  (Real life intrudes in
the case of when exceptions are thrown, however.)

: It might turn out that deferred conversions don't save anything, and if 
: that's so then I can live with that. And we may feel comfortable declaring 
: that we preserve equivalency in Unicode data only, and that's OK too. 
: (Though *you* get to call that one... :)

I think we can have our cake and eat it too if we are very careful to
distinguish semantics from representation.  In the extreme view, you
can have an EBCDIC representation and make it look to the program like
you're processing Unicode, as long as you don't go outside the subset
that corresponds to EBCDIC.  It's just a small matter of programming.  :-)

That being said, I don't think we can easily predict how many passes
we're going to make over the data, and we're going to be making many
passes over the same data, it's more efficient to convert once at the
beginning than to emulate one character set in another each time.  Emulation
has the advantage of keeping the old representation around as long as
possible, however.

Larry
Re: Unicode handling

Reply via email to