Re: RFC 131 (v1) Internal String Storage to be Opaque

Bryan C . Warnock Fri, 18 Aug 2000 15:40:53 -0700
On Fri, 18 Aug 2000, Simon Cozens <[EMAIL PROTECTED]> wrote: 
> RFC 131

{snipped}

Having pondered this (unsuccesfully) for a couple weeks, I will
certainly concede that this is the simplest way of handling this
problem.  

A couple counter-points.

1. I believe your extrapolation from 'one string needs conversion' to
'every string will eventually be converted anyway'  simply isn't true,
at least not all the time.  (Which, of course, is what makes this
problem so difficult. Unfortunately, I can't really say whether it's
true or false most of the time.  I suspect the percentages are really
going to be based on the application.)  Large volume textual scanning,
like email, for instance, where only a small portion of the data may
match the internal format, could perform a multitude of needless, and
pointless, conversions.

2. It will force Perl to be built to the largest data set expected.  If
I expect to handle some UTF32 traffic, my Perl will be built will UTF32
as the default scheme.  This is overkill for the 95% of the time my
scripts run in 100% ASCII mode.  If Perl is built so the base scheme is
pluggable at run-time, (which would break the "single conversion"
scheme, as you could conceivably change the encoding mid-run), you are
now, in essence, doing what you didn't want to do in the first place. 
In both of these cases, you are forcing the programmer to accurately
predict the data set to be expected, or to suffer the bloat of the
worst case scenario, when that scenario may rarely happen.
That bloat, (for those playing along at home), doesn't just extend to
the text you are reading in and writing out, but to perl's core
structure itself.  All the symbols, for instance, will also need to be
converted to the default encoding scheme.  That can be quite hefty.

3. In addition to the run-time slow-down of a total conversion (and,
granted, flag-checking with string-promotion isn't free, either),
you're now also slowing down the compile time, as your script itself
must now be converted to the core's encoding scheme.

4. If you guess wrong, you now can't handle anything longer that comes
in.  Build for UTF8, and choke on UTF16.  Build for 16, choke on 32. 
Build for 32, and cringe as everything is just plain ASCII.  Build one
of each, and then have to guess which one to use.

5. You're now making it very difficult to handle binary data.  If I
open a file with a binary line discipline, and slurp it into a scalar,
Perl should attempt to convert that to some other text coding scheme,
correct?  As I pass that scalar around, how does Perl keep track of the
fact that the data is binary, and not text?  I may still do "textual"
things to it, like scan it for patterns, or print out a hex dump.  Now,
you're either going to convert it, or you'll need some way to flag that
it shouldn't be.  (Or you'll need to create another type of scalar,
with all the regular scalar functions being recreated for the binary
type.)


Maybe these are acceptable, but I consider them to be rather
unfriendly.  I, personally, would rather see run-time auto-convert,
with Perl being able to handle up to UTF16 out of the gate, with hooks
left in to expand, as needed.  (From that perspective, I "strongly" x 5
agree with a level of abstraction in the core to remove the details of
any such implementation to the casual coder - except in the actual
convert logic, of course.)  

 Then again, this may open too many cans of {your proverbial Pandora's
Box}, and the single encoding is, technically, the best method.  Or I
could be just plain wrong.

  -- 
Bryan C. Warnock
([EMAIL PROTECTED])
Re: RFC 131 (v1) Internal String Storage to be Opaque

Reply via email to