Perhaps string promotion a la SV promotion?

You could have each string carry around an extra byte or two of
overhead, and encode ASCII vs UTF8 (vs UTF32 vs FOO).

Line disciplines would set the appropriate flag, and any string
handling function could read the flag if it needed to differentiate
by type, or even convert the string to a different type.

That would save the overhead of converting everything to a different
type when it isn't necessary, plus should allow you to extend string
types to whatever you can squeeze a flag in for and write the
appropriate conversion routines.

I was going to write an RFC suggesting this, but this is a little
outside of my realm, if someone else wants to pick it up, or shoot it
down.


On Sat, 05 Aug 2000, Nick Ing-Simmons wrote:
> Benjamin Stuhl <[EMAIL PROTECTED]> writes:
> >> No, that's the beauty of utf8: the C datatype is still
> >> char* and as long
> >> as you stick to 7-bits ASCII you won't know the
> >> difference. wchar_t
> >> comes from a totally different school of thought, where
> >> all your strings
> >> are instantly incompatible and take twice or four times
> >> the memory.
> >> 
> >> Larry knew what he was doing when he decided on utf8.
> >
> >It has also led to the perl5 internals being, to put it
> >bluntly, a horrible mess. 
> 
> Agreed - but that is due to grafting it in late - and possibly 
> trying to be too clever intuiting whether existing perl5-code is 
> working on bytes or chars.
> 
> But the goal was to avoid a 100Mbyte ASCII "string" becoming a 400Mbyte
> UTF32 "string" with 300Mbytes of 0x000000.
> 
> >And forget about the regex
> >engine.
> 
> We cannot do that ;-) 
> Perhaps the regex engine should always force UF8 form ?
> 
> 
> >Perhaps if it was designed in from the beginning things
> >would be better, 
> 
> That is _our_ job - to make it better.
> 
> >but this is something that needs serious 
> >discussion.
> 
> Consider it started ...
> 
> -- 
> Nick Ing-Simmons
-- 
Bryan C. Warnock
([EMAIL PROTECTED])

Reply via email to