On Fri, Dec 15, 2000 at 03:10:16PM -0500, Dan Sugalski wrote:
> At 11:18 AM 12/15/00 -0600, Jarkko Hietaniemi wrote:
> >On Fri, Dec 15, 2000 at 12:13:01PM +0000, Simon Cozens wrote:
> > > IMHO, the first thing we need to design and code is the API and runtime
> > > library, since everything else builds on top of that, and we can design
> > other
> > > stuff in parallel with coding it. (A lot of it will be grunt work.)
> > >
> > > So, before we start even thinking about what we need, it's time to look
> > at the
> > > vexed question of string representation. How do we do Unicode without
> > getting
> > > into the horrendous non-Latin1 cockups we're seeing on p5p right now? Larry
> >
> >As painful as it may sound (codingwise) I would urge to spare some
> >thought to using (internally) UTF-32 for those encodings for which
> >UTF-8 would be *longer* than the UTF-32 (mainly the Asian scripts).
>
> If we can manage it, I'd prefer to not have a preferred internal
I didn't mean 'preferred', I meant that if UTF-8 would be longer for
some encodings, both for space *and* speed using straight honest UTF-32
would make much more sense.
> representation and Do The Right Thing in a general way. (Though I know that
> we may have to go more specific for speed)
>
> I can see us having good reason to handle at least:
>
> Binary
> UTF-8 (and yes, I know latin-1, or ASCII, or something of the sort is a
> proper subset of UTF-8)
> EBCDIC
> UTF-32
> Shift-JIS
>
> as text. How to generalize the regex engine (which strikes me as the most
> likely piece of perl to care deeply about representation) to handle all the
> types is an interesting question. I'm currently trying to figure out a way
> to generalize things, and it's mostly there, but I'm really worried about
> speed issues because of it.
>
> Worst case, handling bytes and UTF-32 should get us by, (variable-lenth
> encodings are a *pain*...) though we'd be well-served to handle more natively.
EMPHATIC YES (after glaring for weeks at the regex/utf8 code).
>
> Dan
>
> --------------------------------------"it's like this"-------------------
> Dan Sugalski even samurai
> [EMAIL PROTECTED] have teddy bears and even
> teddy bears get drunk
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen