Since I don't see anything to save/restore the instack on subroutine calls, I am wondering what happens if a regex has a (?{ CODE }), and that CODE calls a regex. Are we garunteed that after a regex completes (either succeeds or fails) that the intstack is in the same state it started? If not (and remember, exceptions can leave things in odd states), what keeps things from being fubared?
Throughout the rx opcode definitions, I see much use of string_index to find the character at a particular index. If the string's encoding uses multiple bytes per character, this can be O(N) for each call. Not good. Since str->encoding->skip_forward is supposed to be O(N) in terms of how many chars are skipped forwards, and since most of those string_index()s are one character-index away from an index which was recently accessed, we should be able to switch to that for a great speed improvement. Alas, this only works for within a single op, since there's a chance that intervening ops will cause memory to be allocated, potentially causing a GC, potentially re-locating the string's buffer, and thus invalidating the void* pointer. (Well, unless we use string_pin to keep the string from getting moved.) Is there any chance of us ever getting a real string iterator type, for which this isn't so much of a concern? Actually, if we did our string->skip_forward type stuff, but only saved indices relative to the start of the *buffer*, maybe it would be ok... However, we'd still need an additional int register, for the character as well as byte position. Given the growing number of things that each regex subroutine needs to keep track of (string being searched, start of search as a character offset, start as a byte offset, current search point as a character offset, current search point as a byte offset), IMHO, this would be a good use for a regex state struct. op rx_advance(in pmc, inconst int) { RXState * state = (RXState*)PMC_data($1); STRING * str = state->str; if( state->char_off >= str->strlen ) { goto OFFSET($2); } ++state->char_off; state->byte_off = str->encoding->skip_forward( (void*)((char*)str->strstart + state->byte_off), 1 ) - (char*)str->strstart; goto NEXT(); } op rx_literal(in pmc, in str, inconst int) { RXState * state = (RXState*)PMC_data($1); STRING * str = state->str; void * a = (void*)( (char*)str->strstart + state->byte_off ); void * b = $2->strstart; int num_chars = 0; while( num_chars < $2->strlen ) { INTVAL a_code = str->encoding->decode(a); INTVAL b_code = $2 ->encoding->decode(b); /* XXX should do chartypes, too */ if( a_code != b_code ) goto OFFSET($3); a = str->encoding->skip_forward(a, 1); b = $2 ->encoding->skip_forward(b, 1); ++num_chars; } state->byte_off = (char*)a - (char*)str->strstart; state->char_off = num_chars; goto NEXT(); } op rx_char(in pmc, in int, inconst int) { RXState * state = (RXState*)PMC_data($1); STRING * str = state->str; void * p; if( state->char_off >= str->strlen ) { goto OFFSET($3); } p = state->byte_off + (char*)str->strstart; if( str->encoding->decode(p) != $2 ) { goto OFFSET($3); } ++state->char_off; state->byte_off = str->encoding->skip_forward( p, 1 ) - (char*)str->strstart; goto NEXT(); } .... op rx_search(in pmc, in str, inconst int) { RXState * state = (RXState*)PMC_data($1); STRING * str = state->str; INTVAL idx = string_str_index( str, $2, state->char_off ); if( idx == -1 ) { goto OFFSET($3); } else { char * p = (char*)str->strstart + state->byte_off; state->byte_off = str->encoding->skip_forard( p, idx - state->char_off + $2->strlen ) - p; state->char_off = idx + $2->strlen; } goto NEXT(); } Blech, this is really making me wish for some sort of "Real" string iterator type. -- $a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca );{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED] ]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}