On Friday 04 January 2008 22:29:40 Patrick R. Michaud wrote: > Actually, the perl6 compiler and PCT are really agnostic about utf8 -- > they rely on Parrot to handle any transcoding issues. They try > to keep strings as ASCII whenever possible, and only use unicode:"..." > when there's a character that can't be encoded in ascii. > > An odd(?) feature of Parrot is that if any of the operands > to a string opcode has a utf8 encoding, then the result > ends up being marked as utf8, whether it needs to be or not. > I don't know how ucs2 affects this -- but if the tests aren't > passing after your hack then I suspect that Parrot is unable > to do certain operations (e.g., compare) on ucs2 strings.
That's probably the case. We haven't explicitly supported ICU in a while. > Indeed. Until Parrot's non-ICU implementation becomes a bit > more robust, and when we figure out what is causing the tests > to fail, we could have HLLCompiler check for the presence > of ICU and transcode the source to ucs2 prior to parsing. > It would also be a good idea to get the 'escape' method > of CodeString to somehow produce its strings using > ucs2 instead of utf8 encoding (although imcc doesn't > really support a good way to do that yet). I just stuck the find_encoding and trans_encoding ops in the code as appropriate. > > (Callgrind suggests that about 45% of the running time of the NQP part of > > the build comes from utf8_set_position and utf8_skip_forward.) > > Even better might be to figure out why utf8_set_position > and utf8_skip_forward are slow and see if those can be sped > up somehow. I think you answered this in your next message. They're fairly naive. > I just looked at this, and ouch. Every call to get_codepoint() > for utf8 strings scans from the beginning of the string to > locate the corresponding character position. Since get_codepoint() > is repeatedly called for every find_cclass and find_not_cclass > opcode using a utf8 string target, and since most strings tend > to get "promoted" into utf8, this repeated scanning can end up > being really slow. (For example, the find_not_cclass opcode > gets used for scanning whitespace.) > > Thus, I agree that using a fixed-width encoding on strings > could be a big improvement for anything using PGE, because then > these opcodes would avoid the repeated scans from the start > of the string. I also think this means we need a way in PIR to > easily unicode string literals using fixed-width encodings. > > Also, we ought to be able to speed up find_not_cclass and > find_cclass by using string iterators instead of repeated > calls to get_codepoint. This could reduce the repeated > scans from the beginning of the string. > > Lastly, string iterators on utf8 encoded strings do some > basic memoizing of the last known character offset and > location, but the utf8_set_position function doesn't make > use of this information -- it always restarts from the > beginning. (There's even an XXX note in src/encodings/utf8.c > that says it should use the quickest direction instead > of scanning from the start.) > > If nobody else is likely to look into improving these > sections of the code, I suspect I should probably go > ahead and spend the time to do it. > > c -- what's the individual runtime percentages for > utf8_set_position and utf8_skip_forward? utf8_set_position takes up about 35% of the runtime of the NQP run on parser actions and utf8_skip_forward takes up about 11%. Getting rid of those nearly halves the time it takes. (When I say time, I mean "number of cycles".) -- c