On Sat, Jan 05, 2008 at 11:09:57AM +0000, Nicholas Clark wrote: > On Sat, Jan 05, 2008 at 02:11:35AM -0800, chromatic wrote: > > On Saturday 05 January 2008 01:26:48 Patrick R. Michaud wrote: > > > > I think it will still be worthwhile to investigate > > > converting strings into a fixed-width encoding of some sort > > > instead of performing scans on variable-width encodings. > > > > Agreed... if we figure out our Unicode strategy. > > Jarkko's view was that if he were doing Perl 5 Unicode again he would opt for > fixed width 32 bit rather than UTF-8, because a lot of algorithms, > particularly in regexps, assume linear random access.
Based on what I now realize about working with utf8 strings (and how that affects PGE), I would wholeheartedly agree. > Space wise, a better compromise, at only slightly more complexity > (vtables for accessors feel natural for this) is to go for fixed width, > smallest that will hold the largest Unicode code point in the string, > 7 bit, 8 bit, 16 bit and 32 bit. I think we could probably omit the 7 bit version. Sometimes detecting the largest Unicode point in a string is a bit tricky -- promoting a string to a larger representation is no problem, but figuring out when it's safe to demote a string to a smaller representation may be a bit tricky. The other tricky part to this may be that even though we may use a fixed-width encoding internally, input and output will still often want to use or assume utf8 encoding. So, we'd need to decide where the translations belong -- in Parrot, in the tools, or in the HLL. (Right now I suspect the tools or HLL will have to be responsible for these choices, leaving Parrot to "pure" internal representations of whatever encoding is being advertised.) Thanks for the information, it's a big help in guiding us forward. Pm