Re: Unicode handling

Dan Sugalski Tue, 27 Mar 2001 14:28:11 -0800
At 11:06 AM 3/27/2001 -0800, Larry Wall wrote:
>Dan Sugalski writes:
>: At 07:21 AM 3/27/2001 -0800, Larry Wall wrote:
>: >Dan Sugalski writes:
>: >Assume that in practice most of the normalization will be done by the
>: >input disciplines.  Then we might have a pragma that says to try to
>: >enforce level 1, level 2, level 3 if your data doesn't match your
>: >expectations.  Then hopefully the expected semantics of the operators
>: >will usually (I almost said "normally" :-) match the form of the data
>: >coming in, and forced conversions will be rare.
>:
>: The only problem with that is it means we'll be potentially altering the
>: data as it comes in, which leads back to the problem of input and output
>: files not matching for simple filter programs. (Plus it means we spend CPU
>: cycles altering data that we might not actually need to)
>
>I think the programmer will often know which files are already
>normalized, and can just be slurped in with a :raw discipline or some
>such.  Whether that can be a default filter policy is really a matter
>that depends on how the OS handles things.

I'm not sure that raw's the right word, given that the data is really 
Unicode. It's not raw in the sense that a JPEG image or executable is raw data.

>It's almost more important to know what form the programmer wants for
>the output.  I don't think, by and large, that people will be interested
>in producing files that are of mixed normalization.  On input you take
>what you're given, but on output, you have to make a policy decision.
>That's likelier to be consistent for a whole program (or at least a whole
>lexical scope), so we can probably have a declaration for the preferred
>output form, and default everything that way.

That's fine. I'd sort of assumed that the default encoding would be used 
for this, though I don't think I actually said that.

>This is somewhat orthogonal to the issue of laziness, however.  To the
>first approximation it doesn't really matter when we do the conversion, as
>long as the user sees a consistent semantics.  (Real life intrudes in
>the case of when exceptions are thrown, however.)

True, but we kinda can't dodge the issue here, as we're the folks that get 
to implement the laziness... :)

>That being said, I don't think we can easily predict how many passes
>we're going to make over the data, and we're going to be making many
>passes over the same data, it's more efficient to convert once at the
>beginning than to emulate one character set in another each time.  Emulation
>has the advantage of keeping the old representation around as long as
>possible, however.

I'm half-tempted to implement a 'touch count' in the scalars somewhere to 
track the number of times something's been dealt with in a non-native way 
to use as an indicator of whether we should just up and convert things. I 
can't shake off the feeling that it'll be more expensive than not doing it, 
though.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: Unicode handling

Reply via email to