Since I don't see anything to save/restore the instack on subroutine
calls, I am wondering what happens if a regex has a (?{ CODE }), and
that CODE calls a regex.  Are we garunteed that after a regex completes
(either succeeds or fails) that the intstack is in the same state it
started?  If not (and remember, exceptions can leave things in odd
states), what keeps things from being fubared?

Throughout the rx opcode definitions, I see much use of string_index to
find the character at a particular index.  If the string's encoding uses
multiple bytes per character, this can be O(N) for each call.  Not good.

Since str->encoding->skip_forward is supposed to be O(N) in terms of how
many chars are skipped forwards, and since most of those string_index()s
are one character-index away from an index which was recently accessed,
we should be able to switch to that for a great speed improvement.

Alas, this only works for within a single op, since there's a chance
that intervening ops will cause memory to be allocated, potentially
causing a GC, potentially re-locating the string's buffer, and thus
invalidating the void* pointer.  (Well, unless we use string_pin to keep
the string from getting moved.)

Is there any chance of us ever getting a real string iterator type, for
which this isn't so much of a concern?

Actually, if we did our string->skip_forward type stuff, but only saved
indices relative to the start of the *buffer*, maybe it would be ok...

However, we'd still need an additional int register, for the character
as well as byte position.  Given the growing number of things that each
regex subroutine needs to keep track of (string being searched, start of
search as a character offset, start as a byte offset, current search
point as a character offset, current search point as a byte offset),
IMHO, this would be a good use for a regex state struct.

   op rx_advance(in pmc, inconst int) {
      RXState * state = (RXState*)PMC_data($1);
      STRING * str = state->str;
      if( state->char_off >= str->strlen ) {
         goto OFFSET($2);
      }
      ++state->char_off;
      state->byte_off = str->encoding->skip_forward(
         (void*)((char*)str->strstart + state->byte_off), 1 ) -
            (char*)str->strstart;
      goto NEXT();
   }

   op rx_literal(in pmc, in str, inconst int) {
      RXState * state = (RXState*)PMC_data($1);
      STRING * str = state->str;
      void * a = (void*)( (char*)str->strstart + state->byte_off );
      void * b = $2->strstart;
      int num_chars = 0;
      while( num_chars < $2->strlen ) {
         INTVAL a_code = str->encoding->decode(a);
         INTVAL b_code = $2 ->encoding->decode(b);
         /* XXX should do chartypes, too */
         if( a_code != b_code ) goto OFFSET($3);
         a = str->encoding->skip_forward(a, 1);
         b = $2 ->encoding->skip_forward(b, 1);
         ++num_chars;
      }
      state->byte_off = (char*)a - (char*)str->strstart;
      state->char_off = num_chars;
      goto NEXT();
   }

   op rx_char(in pmc, in int, inconst int) {
      RXState * state = (RXState*)PMC_data($1);
      STRING * str = state->str;
      void * p;
      if( state->char_off >= str->strlen ) {
         goto OFFSET($3);
      }
      p = state->byte_off + (char*)str->strstart;
      if( str->encoding->decode(p) != $2 ) {
         goto OFFSET($3);
      }
      ++state->char_off;
      state->byte_off = str->encoding->skip_forward( p, 1 )
         - (char*)str->strstart;
      goto NEXT();
   }

   ....

   op rx_search(in pmc, in str, inconst int) {
      RXState * state = (RXState*)PMC_data($1);
      STRING * str = state->str;
      INTVAL idx = string_str_index( str, $2, state->char_off );
      if( idx == -1 ) {
         goto OFFSET($3);
      } else {
         char * p = (char*)str->strstart + state->byte_off;
         state->byte_off = str->encoding->skip_forard( p,
            idx - state->char_off + $2->strlen ) - p;
         state->char_off = idx + $2->strlen;
      }
      goto NEXT();
   }

Blech, this is really making me wish for some sort of "Real" string
iterator type.

-- 
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED]
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

Reply via email to