On Sat, Jan 05, 2008 at 11:09:57AM +0000, Nicholas Clark wrote:
> On Sat, Jan 05, 2008 at 02:11:35AM -0800, chromatic wrote:
> > On Saturday 05 January 2008 01:26:48 Patrick R. Michaud wrote:
> 
> > > I think it will still be worthwhile to investigate
> > > converting strings into a fixed-width encoding of some sort
> > > instead of performing scans on variable-width encodings.
> > 
> > Agreed... if we figure out our Unicode strategy.
> 
> Jarkko's view was that if he were doing Perl 5 Unicode again he would opt for
> fixed width 32 bit rather than UTF-8, because a lot of algorithms,
> particularly in regexps, assume linear random access.

Based on what I now realize about working with utf8 strings
(and how that affects PGE), I would wholeheartedly agree.

> Space wise, a better compromise, at only slightly more complexity
> (vtables for accessors feel natural for this) is to go for fixed width,
> smallest that will hold the largest Unicode code point in the string,
> 7 bit, 8 bit, 16 bit and 32 bit.

I think we could probably omit the 7 bit version.  Sometimes
detecting the largest Unicode point in a string is a bit tricky --
promoting a string to a larger representation is no problem, but
figuring out when it's safe to demote a string to a smaller
representation may be a bit tricky.

The other tricky part to this may be that even though we may use a
fixed-width encoding internally, input and output will still often
want to use or assume utf8 encoding.  So, we'd need to decide where
the translations belong -- in Parrot, in the tools, or in the HLL.
(Right now I suspect the tools or HLL will have to be responsible
for these choices, leaving Parrot to "pure" internal representations
of whatever encoding is being advertised.)

Thanks for the information, it's a big help in guiding us forward.

Pm

Reply via email to