On Friday 04 January 2008 22:29:40 Patrick R. Michaud wrote:

> Actually, the perl6 compiler and PCT are really agnostic about utf8 --
> they rely on Parrot to handle any transcoding issues.  They try
> to keep strings as ASCII whenever possible, and only use unicode:"..."
> when there's a character that can't be encoded in ascii.
>
> An odd(?) feature of Parrot is that if any of the operands
> to a string opcode has a utf8 encoding, then the result
> ends up being marked as utf8, whether it needs to be or not.
> I don't know how ucs2 affects this -- but if the tests aren't
> passing after your hack then I suspect that Parrot is unable
> to do certain operations (e.g., compare) on ucs2 strings.

That's probably the case.  We haven't explicitly supported ICU in a while.

> Indeed.  Until Parrot's non-ICU implementation becomes a bit
> more robust, and when we figure out what is causing the tests
> to fail, we could have HLLCompiler check for the presence
> of ICU and transcode the source to ucs2 prior to parsing.
> It would also be a good idea to get the 'escape' method
> of CodeString to somehow produce its strings using
> ucs2 instead of utf8 encoding (although imcc doesn't
> really support a good way to do that yet).

I just stuck the find_encoding and trans_encoding ops in the code as 
appropriate.

> > (Callgrind suggests that about 45% of the running time of the NQP part of
> > the build comes from utf8_set_position and utf8_skip_forward.)
>
> Even better might be to figure out why utf8_set_position
> and utf8_skip_forward are slow and see if those can be sped
> up somehow.

I think you answered this in your next message.  They're fairly naive.

> I just looked at this, and ouch.  Every call to get_codepoint()
> for utf8 strings scans from the beginning of the string to
> locate the corresponding character position.  Since get_codepoint()
> is repeatedly called for every find_cclass and find_not_cclass
> opcode using a utf8 string target, and since most strings tend
> to get "promoted" into utf8, this repeated scanning can end up
> being really slow.  (For example, the find_not_cclass opcode
> gets used for scanning whitespace.)
>
> Thus, I agree that using a fixed-width encoding on strings
> could be a big improvement for anything using PGE, because then
> these opcodes would avoid the repeated scans from the start
> of the string.  I also think this means we need a way in PIR to
> easily unicode string literals using fixed-width encodings.
>
> Also, we ought to be able to speed up find_not_cclass and
> find_cclass by using string iterators instead of repeated
> calls to get_codepoint.  This could reduce the repeated
> scans from the beginning of the string.
>
> Lastly, string iterators on utf8 encoded strings do some
> basic memoizing of the last known character offset and
> location, but the utf8_set_position function doesn't make
> use of this information -- it always restarts from the
> beginning.  (There's even an XXX note in src/encodings/utf8.c
> that says it should use the quickest direction instead
> of scanning from the start.)
>
> If nobody else is likely to look into improving these
> sections of the code, I suspect I should probably go
> ahead and spend the time to do it.
>
> c -- what's the individual runtime percentages for
> utf8_set_position and utf8_skip_forward?

utf8_set_position takes up about 35% of the runtime of the NQP run on parser 
actions and utf8_skip_forward takes up about 11%.  Getting rid of those 
nearly halves the time it takes.

(When I say time, I mean "number of cycles".)

-- c

Reply via email to