[perl #15797] [PATCH] Regex speedup

via RT Mon, 29 Jul 2002 14:26:47 -0700

# New Ticket Created by  Angel Faus 
# Please include the string:  [perl #15797]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt2/Ticket/Display.html?id=15797 >



Hi,

I've made a patch for the regex engine, designed with the single goal 
of seriously cheating for speed. :-)

This are the most important changes:

* There is no regex state PMC, so there is not cost of creating and 
destroying it. (all the state is stored in registers, or in a 
per-interpreter intstack) [this was suggested by Dan]

* Some regex ops have been combined in order to increase code density, 
or/and save some repeated computiations. In some cases, this even 
allows to avoid using using the stack in places that previously used 
it.

* Some minor tweaks in string.c, in order to inline the computation of 
string_index, when the string is a native one.

Some benchmarks (200.000 iterations, the loop inside parrot/perl):

parrot          perl
------------------
0.2012          0.2887          /^zza/
0.5089          0.6358          /ab*cd[cd]*/
0.6557          1.5460          /a(?:b|ac)*[cd]*/
1.7316          3.2161          /a(b|ac)*[cd]*/   (capturing)

This benchmark assumes a smart compiler on the parrot side, capable of 
detecting optimitzation oportuinities. I started with this regexes, 
and added optimitzations on the regex engine until they were faster 
than their perl cousins. So this is not a complete speed comparision. 

I think that the benchamarks are fair, but I could be wrong. The 
attached examples_regex.tar.gz includes all the files needed to 
reproduce them.

Anyway, this patch has brought me the personal conviction that parrot 
regexes can be as fast or faster than their perl equivalents, if we 
put a bit of effort on optimitzation.

The fast_re.patch file includes all the changes, and updates the tests 
to use the new opcodes. Documentation is not updated, though.

Best,

Angel Faus
[EMAIL PROTECTED]


-- attachment  1 ------------------------------------------------------
url: http://rt.perl.org/rt2/attach/31934/26589/cc89a5/fast_re.patch

-- attachment  2 ------------------------------------------------------
url: http://rt.perl.org/rt2/attach/31934/26590/10e423/examples_regex.tar.gz

Index: rx.c
===================================================================
RCS file: /cvs/public/parrot/rx.c,v
retrieving revision 1.16
diff -r1.16 rx.c
30,51d29
< rxinfo *
< rx_allocate_info(struct Parrot_Interp *interpreter, STRING *string)
< {
<     rxinfo *rx = mem_sys_allocate(sizeof(rxinfo));
< 
<     rx->minlength = rx->index = rx->startindex = 0;
<     rx->flags = enum_rxflags_none;
<     rx->whichway = enum_rxdirection_forwards;
< 
<     rx->string = string;
< 
<     rx->groupstart = pmc_new(interpreter, enum_class_PerlArray);
<     rx->groupend = pmc_new(interpreter, enum_class_PerlArray);
< 
<     rx->stack = intstack_new(interpreter);
< 
<     string_transcode(interpreter, rx->string, encoding_lookup("utf32"),
<                      rx->string->type, &rx->string);
< 
<     return rx;
< }
< 
190a169
> 
Index: rx.ops
===================================================================
RCS file: /cvs/public/parrot/rx.ops,v
retrieving revision 1.22
diff -r1.22 rx.ops
7c7
< #define RxAssertMore(rx, branchto) if((UINTVAL)rx->index >= string_length(rx->string)) { goto OFFSET(branchto); }
---
> #define RxAssertMore(str, idx, branchto) if( (UINTVAL)idx >= string_length(str) ) { goto OFFSET(branchto); }
171,283d170
< ###############################################################################
< 
< =head3 Preparation
< 
< =over 4
< 
< =cut
< 
< ########################################
< 
< =item C<rx_allocinfo>(out pmc, in str)
< 
< =item C<rx_allocinfo>(out pmc, in pmc)
< 
< Allocates a new info structure and puts it into the first parameter.  The second parameter
< is the string to match against.
< 
< =cut
< 
< op rx_allocinfo(out pmc, in str) {
< 	rxinfo *rx=rx_allocate_info(interpreter, $2);
< 
< 	$1=pmc_new(interpreter, enum_class_Pointer);
< 
< 	$1->data=(void*)rx;
< 
< 	goto NEXT();
< }
< 
< op rx_allocinfo(out pmc, in pmc) {
< 	rxinfo *rx=rx_allocate_info(interpreter, $2->vtable->get_string(interpreter, $2));
< 	
< 	$1=pmc_new(interpreter, enum_class_Pointer);
< 
< 	$1->data=(void*)rx;
< 	
< 	goto NEXT();
< }
< 
< ########################################
< 
< =item C<rx_clearinfo>(out pmc, in str)
< 
< Clears the data out of an existing info structure, preparing it for reuse.  The second 
< parameter is the new string to match against.  This is used to avoid memory reallocations;
< many programs will only need one info structure allocation over the program's lifetime.
< 
< =cut
< 
< op rx_clearinfo(inout pmc, in str) {
< 	RX_dUNPACK($1);
< 	rx->index=rx->startindex=0;
< 	rx->flags=enum_rxflags_none;
< 	rx->success=0;
< 	rx->minlength=0;
< 	rx->whichway=enum_rxdirection_forwards;
< 	
< 	while(intstack_depth(interpreter, rx->stack)) {
< 		(void)intstack_pop(interpreter, rx->stack);
< 	}
< 	
< 	rx->string=$2;
< 	
< 	(void)string_transcode(interpreter, rx->string, encoding_lookup("utf32"), rx->string->type, &rx->string);
< 	
< 	goto NEXT();
< }
< 
< ########################################
< 
< =item C<rx_freeinfo>(inout pmc)
< 
< Deallocates the info structure in the first parameter and nulls out the handle.
< 
< =cut
< 
< op rx_freeinfo(inout pmc) {
< 	mem_sys_free($1->data);
< 	$1->data=NULL;
< 	
< 	goto NEXT();
< }
< 
< ########################################
< 
< =item C<rx_cloneinfo>(inout pmc)
< 
< Clones the info structure in the first parameter.  Make sure to save the original 
< structure in another register, the stack, or a symbol table entry before calling this
< opcode.
< 
< This opcode actually creates a new stack for the new regular expression structure, but
< all other fields (including the group-related ones) stay the same.  It's primarily used
< for things like lookaheads and lookbehinds, where the regex's state should be completely
< restored to the original version if the match succeeds.  (Well, almost completely--groups
< matched with a cloned structure live on in the original.)
< 
< =cut
< 
< op rx_cloneinfo(inout pmc) {
< 	rxinfo *rx2;
< 	RX_dUNPACK($1);
< 	
< 	rx2=mem_sys_allocate(sizeof(rxinfo));
< 	*rx2=*rx;
< 	
< 	rx2->stack=intstack_new(interpreter);
< 
< 	$1=pmc_new(interpreter, enum_class_Pointer);
< 	$1->data=rx2;
< 	
< 	goto NEXT();
< }
302a190
> 
307c195
< =head3 Info accessor ops
---
> =head3 Stack manipulation ops
315,332c203
< =item C<rx_info_successful>(in pmc, out int)
< 
< If the info structure indicates the match was successful, sets the second parameter
< to true; otherwise sets it to false.
< 
< =cut
< 
< op rx_info_successful(in pmc, out int) {
< 	RX_dUNPACK($1);
< 
< 	$2=rx->success;
< 	
< 	goto NEXT();
< }
< 
< ########################################
< 
< =item C<rx_info_getindex>(in pmc, out int)
---
> =item C<rx_pushindex>(in pmc)
334,335c205
< Retrieves the current index stored in the info structure.  If the match has already
< finished successfully, this will be the index of the end of the match.
---
> Pushes the current index onto the stack contained in the info structure.
339,342c209,210
< op rx_info_getindex(in pmc, out int) {
< 	RX_dUNPACK($1);
< 	
< 	$2=rx->index;
---
> op rx_pushindex(out int) {
> 	intstack_push(interpreter, interpreter->ctx.intstack, $1);
347,362c215,216
< ########################################
< 
< =item C<rx_info_getstartindex>(in pmc, out int)
< 
< Gets the index the match started at.
< 
< Note that if a regex uses the C<rx_backwards(p)> op, the start and end indices may be
< reversed.
< 
< =cut
< 
< op rx_info_getstartindex(in pmc, out int) {
< 	RX_dUNPACK($1);
< 
< 	$2=rx->startindex;
< 	
---
> op rx_initstack() {
> 	interpreter->ctx.intstack = intstack_new(interpreter);
366,381c220,221
< ########################################
< 
< =item C<rx_info_getgroup>(in pmc, out int, out int, in int)
< 
< Gets the start and end indices of the group indicated by the fourth parameter.
< 
< =cut
< 
< op rx_info_getgroup(in pmc, out int, out int, in int) {
<         KEY key;
< 	RX_dUNPACK($1);
< 
<         MAKE_KEY(key, $4, enum_key_int, int_val);
< 	$2=rx->groupstart->vtable->get_integer_keyed(interpreter, rx->groupstart, &key);
< 	$3=rx->groupend->vtable->get_integer_keyed(interpreter, rx->groupend, &key);
< 	
---
> op rx_clearstack () {
> 	intstack_free(interpreter, interpreter->ctx.intstack);
384,410d223
< 
< ###############################################################################
< 
< =back
< 
< =head3 Stack manipulation ops
< 
< =over 4
< 
< =cut
< 
< ########################################
< 
< =item C<rx_pushindex>(in pmc)
< 
< Pushes the current index onto the stack contained in the info structure.
< 
< =cut
< 
< op rx_pushindex(in pmc) {
< 	RX_dUNPACK($1);
< 
< 	intstack_push(interpreter, rx->stack, rx->index);
< 	
< 	goto NEXT();
< }
< 
420,423c233,234
< op rx_pushmark(in pmc) {
< 	RX_dUNPACK($1);
< 
<  	intstack_push(interpreter, rx->stack, RX_MARK);
---
> op rx_pushmark() {
>  	intstack_push(interpreter, interpreter->ctx.intstack, RX_MARK);
430c241
< =item C<rx_popindex>(in pmc, inconst int)
---
> =item C<rx_popindex>(out int, inconst int)
437,438c248
< op rx_popindex(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_popindex(out int, inconst int) {
441c251
< 	i=intstack_pop(interpreter, rx->stack);
---
> 	i=intstack_pop(interpreter, interpreter->ctx.intstack);
447c257
< 		rx->index=i;
---
> 		$1=i;
462,464d271
< ########################################
< 
< =item C<rx_forwards>(in pmc)
466c273
< Indicates that the regex should increment the index as it moves through the string.
---
> ###############################################################################
468c275
< =cut
---
> =back
470,471c277
< op rx_forwards(in pmc) {
< 	RX_dUNPACK($1);
---
> =head3 Matching ops
473,476c279
< 	rx->whichway=enum_rxdirection_forwards;
< 	
< 	goto NEXT();
< }
---
> =over 4
477a281
> =cut
481c285
< =item C<rx_backwards>(in pmc)
---
> =item C<rx_advance>(in str, inout int, inconst int)
483,485c287,288
< Indicates that the regex should decrement the index as it moves through the string.
< This is different from reversed regexes (see L</"rx_setprops(p, sc, ic)">); reversed
< affects the start index, while backwards affects the end index.
---
> Increments the start index one character.  Branches to the second parameter
> if it goes past the end of the string.
489,490c292,296
< op rx_backwards(in pmc) {
< 	RX_dUNPACK($1);
---
> op rx_advance(in str, inout int, inconst int) {
> 
> 	if ( (UINTVAL) $2++ > string_length($1)) {
> 		goto OFFSET($3);
> 	}
492,493d297
< 	rx->whichway=enum_rxdirection_backwards;
< 	
497c301
< ###############################################################################
---
> op rx_search(in str, inout int, inout int, in str, inconst int) {
499,507c303,306
< =back
< 
< =head3 Matching ops
< 
< =over 4
< 
< =cut
< 
< ########################################
---
> 	int literal_length, str_length, i, idx, start;
> 	
> 	str_length = string_length($1);
> 	literal_length = string_length($4);
509c308,309
< =item C<rx_advance>(in pmc, inconst int)
---
> 	i = 0;
> 	start = $3;
511,512c311,314
< Increments (or decrements, if the C<r> modifier is used) the start index one
< character.  Branches to the second parameter if it goes past the end of the string.
---
> 	/* Check if the string is long enough */
> 	if (start + literal_length > str_length) {
> 		goto OFFSET($5);
> 	}
514c316
< =cut
---
> 	while (i < literal_length) {
516,517c318,320
< op rx_advance(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> 		if (string_index($1, start+i) != string_index($4, i)) {
> 			i = 0;
> 			start++;
519,521c322,325
< 	if(!RxReverse_test(rx)) {
< 		if(++rx->startindex + rx->minlength > string_length(rx->string)) {
< 			goto OFFSET($2);
---
> 			/* Check again */
> 			if (start + literal_length > str_length) {
> 				goto OFFSET($5);
> 			}
523,526c327,329
< 	}
< 	else {
< 		if(--rx->startindex < 0) {
< 			goto OFFSET($2);
---
> 		
> 		else {
> 			i++;
530,531c333,336
< 	rx->index=rx->startindex;
< 	
---
> 	$2 = start + literal_length;
>  	$3 = start; 
> 
> 	/*
534a340
> 	*/
539,541c345
< ########################################
< 
< =item C<rx_incrindex>(in pmc, in int)
---
> op rx_search_char (in str, inout int, inout int, in int, inconst int) {
543,551c347
< Increments the current index (or decrements, if C<rx_backwards> is used) by the 
< amount in the second parameter.  Does I<not> check if it's gone past the end of the
< string.
< 
< =cut
< 
< op rx_incrindex(in pmc, in int) {
< 	RX_dUNPACK($1);
< 	RxAdvanceX(rx, $2);
---
> 	int str_length, idx, start;
553,556c349
< 	goto NEXT();
< }
< 
< ########################################
---
> 	str_length = string_length($1);
558c351
< =item C<rx_setprops>(in pmc, in str, in int)
---
> 	start = $3;
560,565c353,356
< Sets certain properties in the info structure.  The second parameter is a string
< containing one or more of the following characters:
< 
< =over 4
< 
< =item C<i>
---
> 	/* Check if the string is long enough */
> 	if (start + 1 > str_length) {
> 		goto OFFSET($5);
> 	}
567c358
< Sets case-insensitive matching.
---
> 	for (idx = start; idx < str_length; idx++) {
569c360,362
< =item C<s>
---
> 		if (string_index($1, idx) == $4) {
> 			$2 = idx + 1;
> 			$3 = idx;
571c364,366
< Sets single-line matching; the C<rx_dot> op will match newlines with this turned on.
---
> 			goto NEXT();
> 		}
> 	}
573c368,369
< =item C<m>
---
> 	goto OFFSET($5);
> }
575,576d370
< Sets multiline matching; the C<rx_zwa_atbeginning> and C<rx_zwa_atend> opcodes will
< match the beginning and end of lines.
578c372
< =item C<r>
---
> ########################################
580,581c374
< Sets reverse or right matching; match starts at the end of the string and inches
< towards the beginning.
---
> =item C<rx_literal>(inout int, in str, in str, inconst int)
583c376,377
< =back
---
> Matches the exact string (sensitive to the C<i> modifier) passed in the second
> parameter.
585,586c379
< The third parameter is the minimum length the string would need to be for a match to
< be possible.  For example, in the match C</ba*r+/>, the minimum length is 2.
---
> B<XXX> Currently does not honor the C<i> modifier.
590c383
< op rx_setprops(in pmc, in str, in int) {
---
> op rx_literal(in str, inout int, in str, inconst int) {
592c385
< 	RX_dUNPACK($1);
---
> 	INTVAL new_idx = $2 + string_length($3);
594,622c387,388
< 	rx->minlength=$3;
< 
< 	for(i=0; i < string_length($2); i++) {
< 		switch((char)string_ord($2, (INTVAL)i)) {
< 			case 'i':
< 				RxCaseInsensitive_on(rx);
< 				
< 				if(!RxFlagTest(rx, enum_rxflags_is_copy)) {
< 					RxFlagOn(rx, enum_rxflags_is_copy);
< 					rx->string=string_copy(interpreter, rx->string);
< 				}
< 				
< 				/* string_lc(interpreter, rx->string); */
< 				
< 				break;
< 			case 's':
< 				RxSingleLine_on(rx);
< 				break;
< 			case 'm':
< 				RxMultiline_on(rx);
< 				break;
< 			case 'r':
< 				RxReverse_on(rx);
< 				rx->index=rx->startindex=string_length(rx->string) - rx->minlength - 1;
< 				break;
< 			default:
< 				fprintf(stderr, "Unknown regular expression option '%c'.", (char)string_ord($2, (INTVAL)i));
< 				HALT();
< 		}
---
> 	if( (INTVAL) string_length($1) < new_idx) {
> 		goto OFFSET($4);
624a391,398
> 	/* This is faster than using substr--it avoids memory allocations. */
> 	for(i=0; i < string_length($3); i++) {
> 		if(string_index($1, $2+(INTVAL)i) != string_index($3, (INTVAL)i)) {
> 			goto OFFSET($4);
> 		}
> 	}
> 
> 	$2 = new_idx;
628c402,404
< ########################################
---
> op rx_literal_all (in str, inout int, in str) {
> 	INTVAL i, idx;
> 	UINTVAL literal_length, str_length;
630c406,407
< =item C<rx_startgroup>(in pmc, in int)
---
> 	str_length = string_length($1);
> 	literal_length = string_length($3);
632,633c409,418
< Indicates that the current index is the start index of the group number indicated in
< the second parameter.
---
> 	idx = $2;
> 	while ( str_length > idx + literal_length ) {
> 	
> 		/* This is faster than using substr--it avoids memory allocations. */
> 		for(i=0; i < (INTVAL) literal_length; i++) {
> 			if(string_index($1, idx+i) != string_index($3, i)) {
> 					$2 = idx;
> 				goto NEXT();
> 			}
> 		}
635c420,421
< =cut
---
> 		idx = idx + literal_length;
> 	}
637,643c423
< op rx_startgroup(in pmc, in int) {
<         KEY key;
< 	RX_dUNPACK($1);
< 	
<         MAKE_KEY(key, $2, enum_key_int, int_val);
< 	rx->groupstart->vtable->set_integer_keyed(interpreter, rx->groupstart, &key, rx->index);
< 	
---
> 	$2 = idx;
647,649c427,428
< ########################################
< 
< =item C<rx_endgroup>(in pmc, in int)
---
> op rx_char(in str, inout int, in int, inconst int) {
> 	UINTVAL i;
651,652c430,436
< Indicates that the current index is the end index of the group number indicated in
< the second parameter.
---
> 	if( (INTVAL) string_length($1) <= $2) {
> 		goto OFFSET($4);
> 	}
> 	
> 	if(string_index($1, $2) != $3) {
> 		goto OFFSET($4);
> 	}
654c438,439
< =cut
---
> 	$2++;
> 	goto NEXT();
656,663d440
< op rx_endgroup(in pmc, in int) {
<     KEY key;
<     RX_dUNPACK($1);
< 	
<     MAKE_KEY(key, $2, enum_key_int, int_val);        
<     rx->groupend->vtable->set_integer_keyed(interpreter, rx->groupend, &key, rx->index);
< 	
<     goto NEXT();
666c443,445
< ########################################
---
> op rx_char_all (in str, inout int, in int) {
> 	UINTVAL idx;
> 	UINTVAL str_length;
668c447
< =item C<rx_literal>(in pmc, in str, inconst int)
---
> 	str_length = string_length($1);
670,671c449,454
< Matches the exact string (sensitive to the C<i> modifier) passed in the second
< parameter.
---
> 	for (idx = $2; idx <= str_length; idx++) { 
> 		if(string_index($1, idx) != $3) {
> 			$2 = idx;
> 			goto NEXT();
> 		}
> 	}
673c456,458
< B<XXX> Currently does not honor the C<i> modifier.
---
> 	$2 = idx;
> 	goto NEXT();
> }
675c460
< =cut
---
> op rx_oneof_bmp_all (in str, inout int, in pmc) {
677,679c462,464
< op rx_literal(in pmc, in str, inconst int) {
< 	UINTVAL i;
< 	RX_dUNPACK($1);
---
> 	Bitmap bmp;
> 	UINTVAL idx;
> 	UINTVAL str_length;
681,688c466,473
< 	if(string_length(rx->string) < rx->index+string_length($2)) {
< 		goto OFFSET($3);
< 	}
< 	
< 	/* This is faster than using substr--it avoids memory allocations. */
< 	for(i=0; i < string_length($2); i++) {
< 		if(string_ord(rx->string, rx->index+(INTVAL)i) != string_ord($2, (INTVAL)i)) {
< 			goto OFFSET($3);
---
> 	str_length = string_length($1);
> 	idx = $2;
> 	bmp = $3->data;
> 
> 	while (idx < str_length) { 
> 		if(! bitmap_match(bmp, string_index($1,idx) ) ) { 
> 			$2 = idx;
> 			goto NEXT();
689a475
> 		idx++;
692c478
< 	RxAdvanceX(rx, string_length($2));
---
> 	$2 = idx;
698c484
< =item C<rx_is_w>(in pmc, inconst int)
---
> =item C<rx_is_w>(in str, inout int, inconst int)
704,705c490
< op rx_is_w(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_is_w(in str, inout int, inconst int) {
707c492
< 	RxAssertMore(rx, $2);
---
> 	RxAssertMore($1, $2, $3);
709,710c494,495
< 	if(rx_is_word_character(interpreter, RxCurChar(rx))) {
< 		RxAdvance(rx);
---
> 	if(rx_is_word_character(interpreter, string_index($1, $2)) ) {
> 		$2++;
714c499
< 		goto OFFSET($2);
---
> 		goto OFFSET($3);
721c506
< =item C<rx_is_d>(in pmc, inconst int)
---
> =item C<rx_is_d>(in str, inout int, inconst int)
727,728c512
< op rx_is_d(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_is_d(in str, inout int, inconst int) {
730c514
< 	RxAssertMore(rx, $2);
---
> 	RxAssertMore($1, $2, $3);
732,733c516,517
< 	if(rx_is_number_character(interpreter, RxCurChar(rx))) {
< 		RxAdvance(rx);
---
> 	if(rx_is_number_character(interpreter, string_index($1,$2)) ) {
> 		$2++;
737c521
< 		goto OFFSET($2);
---
> 		goto OFFSET($3);
743c527
< =item C<rx_is_s>(in pmc, inconst int)
---
> =item C<rx_is_s>(in str, inout int, inconst int)
749,750c533
< op rx_is_s(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_is_s(in str, inout int, inconst int) {
752c535
< 	RxAssertMore(rx, $2);
---
> 	RxAssertMore($1, $2, $3);
754,755c537,538
< 	if(rx_is_whitespace_character(interpreter, RxCurChar(rx))) {
< 		RxAdvance(rx);
---
> 	if(rx_is_whitespace_character(interpreter, string_index($1,$2)) ) {
> 		$2++;
759c542
< 		goto OFFSET($2);
---
> 		goto OFFSET($3);
766c549
< =item C<rx_oneof>(in pmc, in str, inconst int)
---
> =item C<rx_oneof>(in str, inout int, in pmc, inconst int)
774,775d556
< B<XXX> Currently does not honor the C<i> modifier.
< 
778,779c559,560
< op rx_oneof(in pmc, in str, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_oneof(in str, inout int, in str, inconst int) {
> 
782c563
< 	RxAssertMore(rx, $3);
---
> 	RxAssertMore($1, $2, $4);
784c565
< 	bitmap=bitmap_make(interpreter, $2);	
---
> 	bitmap=bitmap_make(interpreter, $3);	
786c567
< 	if(bitmap_match(bitmap, RxCurChar(rx))) {
---
> 	if(bitmap_match(bitmap, string_index($1, $2)) ) {
789c570
< 		RxAdvance(rx);
---
> 		$2++;
794c575
< 		goto OFFSET($3);
---
> 		goto OFFSET($4);
801c582
< =item C<rx_oneof_bmp>(in pmc, in pmc, inconst int)
---
> =item C<rx_oneof_bmp>(in str, inout int, in pmc, inconst int)
808,809c589
< op rx_oneof_bmp(in pmc, in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_oneof_bmp(in str, inout int, in pmc, inconst int) {
811c591,593
< 	RxAssertMore(rx, $3);
---
> 	if ( (INTVAL)string_length($1) < $2 + 1) {
> 		goto OFFSET($4);
> 	}
813,814c595,597
< 	if(bitmap_match($2->data, RxCurChar(rx))) {
< 		RxAdvance(rx);
---
> 	
> 	if(bitmap_match($3->data, string_index($1,$2) )) {  /* <-------- */
> 		$2++;
818c601
< 		goto OFFSET($3);
---
> 		goto OFFSET($4);
844c627
< =item C<rx_dot>(in pmc, inconst int)
---
> =item C<rx_dot>(in str, inout int, inconst int)
851,868c634,638
< op rx_dot(in pmc, inconst int) {
< 	RX_dUNPACK($1);
< 
< 	RxAssertMore(rx, $2);
< 
< 	if(RxSingleLine_test(rx)) {
< 		RxAdvance(rx);
< 		goto NEXT();
< 	}
< 	else {
< 		if(!rx_is_newline(interpreter, RxCurChar(rx))) {
< 			RxAdvance(rx);
< 			goto NEXT();
< 		}
< 		else {
< 			goto OFFSET($2);
< 		}
< 	}
---
> op rx_dot(in str, inout int, inconst int) {
> 	RxAssertMore($1, $2, $3);
> 	$2++;
> 	goto NEXT();
> 	
873c643
< =item C<rx_zwa_boundary>(in pmc, inconst int)
---
> =item C<rx_zwa_boundary>(in str, in int, inconst int)
880,881c650
< op rx_zwa_boundary(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_zwa_boundary(in str, in int, inconst int) {
884,887c653,654
< 	one=rx_is_word_character(interpreter, RxCurChar(rx));
< 	RxAdvanceX(rx, -1);
< 	two=rx_is_word_character(interpreter, RxCurChar(rx));
< 	RxAdvance(rx);
---
> 	one=rx_is_word_character(interpreter, string_index($1,$2));
> 	two=rx_is_word_character(interpreter, string_index($1, $2 - 1));
890,918c657
< 		goto OFFSET($2);
< 	}
< 	
< 	goto NEXT();
< }
< 
< ########################################
< 
< =item C<rx_zwa_atbeginning>(in pmc, inconst int)
< 
< Matches at the beginning of the string.  If the C<m> modifier is used, matches at the
< beginning of any line.
< 
< B<XXX> Currently does not honor the C<m> modifier.
< 
< =cut
< 
< op rx_zwa_atbeginning(in pmc, inconst int) {
< 	RX_dUNPACK($1);
< 	
< 	if(RxMultiline_test(rx)) {
< 		if(!rx_is_newline(interpreter, string_ord(rx->string, rx->index-1))) {
< 			goto OFFSET($2);
< 		}
< 	}
< 	else {
< 		if(rx->index != 0) {
< 			goto OFFSET($2);
< 		}
---
> 		goto OFFSET($3);
926c665
< =item C<rx_zwa_atend>(in pmc, inconst int)
---
> =item C<rx_zwa_atend>(in str, in int, inconst int)
928,929c667
< Matches at the end of the string.  If the C<m> modifier is used, matches at the
< end of any line.
---
> Matches at the end of the string.  
931d668
< B<XXX> Currently does not honor the C<m> modifier.
935,936c672
< op rx_zwa_atend(in pmc, inconst int) {
< 	RX_dUNPACK($1);
---
> op rx_zwa_atend(in str, in int, inconst int) {
938,946c674,675
< 	if(RxMultiline_test(rx)) {
< 		if(!rx_is_newline(interpreter, RxCurChar(rx))) {
< 			goto OFFSET($2);
< 		}
< 	}
< 	else {
< 		if((UINTVAL)rx->index != string_length(rx->string)) {
< 			goto OFFSET($2);
< 		}
---
> 	if((UINTVAL)$2 != string_length($1)) {
> 		goto OFFSET($3);
952,982d680
< ########################################
< 
< =item C<rx_succeed>(in pmc)
< 
< Modifies the info structure to indicate that the match succeeded.
< 
< =cut
< 
< op rx_succeed(in pmc) {
< 	RX_dUNPACK($1);
< 
< 	rx->success=1;
< 	
< 	goto NEXT();
< }
< 
< ########################################
< 
< =item C<rx_fail>(in int)
< 
< Modifies the info structure to indicate that the match failed.
< 
< =cut
< 
< op rx_fail(in pmc) {
< 	RX_dUNPACK($1);
< 
< 	rx->success=0;
< 	
< 	goto NEXT();
< }
990a689,690
> 
> XXX: this tutorial refers to the old regex design, it needs to be updated the new one.
Index: rxstacks.c
===================================================================
RCS file: /cvs/public/parrot/rxstacks.c,v
retrieving revision 1.7
diff -r1.7 rxstacks.c
103a104,115
> 
> void intstack_free (struct Parrot_Interp *interpreter, IntStack stack)
> {
>     IntStack chunk, temp;
> 
>     for (chunk = stack->next; chunk != stack; chunk = temp) {
>         temp = chunk->next;        
>         mem_sys_free(chunk);
>     }
> 
>     mem_sys_free(stack);
> }   
Index: string.c
===================================================================
RCS file: /cvs/public/parrot/string.c,v
retrieving revision 1.83
diff -r1.83 string.c
171c171
< INTVAL
---
> inline INTVAL
174c174,182
<     return s->encoding->decode(s->encoding->skip_forward(s->bufstart, idx));
---
>     if (s->encoding->index == enum_encoding_singlebyte) {
>         /* This inlines the computations used for the case that the strings is 
>          * in a singlebyte encoding. 
>          * Whether this code is correct for any chartype, i don't know */
>         return *((unsigned char*) s->bufstart + idx);
>     }
>     else {
>         return s->encoding->decode(s->encoding->skip_forward(s->bufstart, idx));
>     }
176a185,191
> 
> inline INTVAL string_index_native(const STRING *s, UINTVAL idx) 
> {
>     return * ((unsigned char*)s->bufstart + idx);
> }   
> 
> 
266c281,283
<         dest = string_copy(interpreter, src);
---
>         /*dest = string_copy(interpreter, src);*/
> 
>         dest = src;
Index: include/parrot/pmc.h
===================================================================
RCS file: /cvs/public/parrot/include/parrot/pmc.h,v
retrieving revision 1.32
diff -r1.32 pmc.h
4c4
<  *     $Id: pmc.h,v 1.32 2002/07/18 04:30:42 mongo Exp $
---
>  *     $Id: pmc.h,v 1.31 2002/07/04 18:31:20 mrjoltcola Exp $
29a30
>     enum_class_PerlReInfo,
113c114,117
<     PMC_constant_FLAG = 1 << 18
---
>     PMC_constant_FLAG = 1 << 18,
>     /* Immunity flag, for ensuring a PMC survives DOD. Used internally
>      * by the GC: should not be used in PMC code. */
>     PMC_immune_FLAG = 1 << 19
Index: include/parrot/rx.h
===================================================================
RCS file: /cvs/public/parrot/include/parrot/rx.h,v
retrieving revision 1.20
diff -r1.20 rx.h
27,42d26
< typedef enum rxflags {
<     enum_rxflags_none = 0,
<     enum_rxflags_case_insensitive = 1,
<     enum_rxflags_single_line = 2,
<     enum_rxflags_multiline = 4,
<     enum_rxflags_reverse = 8,
< 
< 
<     enum_rxflags_is_copy = 128
< } rxflags;
< 
< typedef enum rxdirection {
<     enum_rxdirection_forwards = 1,
<     enum_rxdirection_backwards = -1
< } rxdirection;
< 
51,71d34
< typedef struct rxinfo {
<     STRING *string;
<     INTVAL index;
<     INTVAL startindex;
<     INTVAL success;
< 
<     rxflags flags;
<     UINTVAL minlength;
<     rxdirection whichway;
< 
<     PMC *groupstart;
<     PMC *groupend;
< 
<     opcode_t *substfunc;
< 
<     IntStack stack;
< } rxinfo;
< 
< 
< rxinfo *rx_allocate_info(struct Parrot_Interp *, STRING *);
< 
82,111d44
< 
< #define RX_dUNPACK(pmc)            rxinfo *rx=(rxinfo *)(pmc)->data
< #define RxCurChar(rx)              ((char)string_ord((rx)->string, \
<                                                      (rx)->index))
< 
< #define RxAdvance(rx)              RxAdvanceX((rx), 1)
< #define RxAdvanceX(rx, x)          ((rx)->index += (x) * (rx)->whichway)
< 
< #define RxCaseInsensitive_on(rx)   RxFlagOn(rx, enum_rxflags_case_insensitive)
< #define RxCaseInsensitive_off(rx)  RxFlagOff(rx, enum_rxflags_case_insensitive)
< #define RxCaseInsensitive_test(rx) RxFlagTest(rx, \
<                                               enum_rxflags_case_insensitive)
< 
< #define RxSingleLine_on(rx)        RxFlagOn(rx, enum_rxflags_single_line)
< #define RxSingleLine_off(rx)       RxFlagOff(rx, enum_rxflags_single_line)
< #define RxSingleLine_test(rx)      RxFlagTest(rx, enum_rxflags_single_line)
< 
< #define RxMultiline_on(rx)         RxFlagOn(rx, enum_rxflags_multiline)
< #define RxMultiline_off(rx)        RxFlagOff(rx, enum_rxflags_multiline)
< #define RxMultiline_test(rx)       RxFlagTest(rx, enum_rxflags_multiline)
< 
< #define RxReverse_on(rx)           RxFlagOn(rx, enum_rxflags_reverse)
< #define RxReverse_off(rx)          RxFlagOff(rx, enum_rxflags_reverse)
< #define RxReverse_test(rx)         RxFlagTest(rx, enum_rxflags_reverse)
< 
< #define RxFlagOn(rx, flag)         ((rx)->flags |=  (flag))
< #define RxFlagOff(rx, flag)        ((rx)->flags &= ~(flag))
< #define RxFlagTest(rx, flag)       ((rx)->flags  &  (flag))
< 
< #define RxFlagsOff(rx)             ((rx)->flags = enum_rxflags_none)
Index: include/parrot/rxstacks.h
===================================================================
RCS file: /cvs/public/parrot/include/parrot/rxstacks.h,v
retrieving revision 1.3
diff -r1.3 rxstacks.h
40a41,42
> void intstack_free(struct Parrot_Interp *, IntStack);
> 
Index: t/op/rx.t
===================================================================
RCS file: /cvs/public/parrot/t/op/rx.t,v
retrieving revision 1.10
diff -r1.10 rx.t
1c1
< use Parrot::Test tests => 27;
---
> use Parrot::Test tests => 23;
5,6d4
< 	$_[2] ||= "";
< 	$_[3] ||= 0;
10,14c8,32
< 		rx_allocinfo P0, S0
< 		bsr RX_0
< 		rx_info_successful P0, I0
< 		if I0, YUP
< 		print "no match\\n"
---
> 		set I0, 0
> 
> 	START:	
> 		$_[1]
> 		
> 	SUCCEED:
> 		length I2, S0
> 
> 		substr S1, S0,  0, I0
> 		sub I4, I1, I0
> 		substr S2, S0, I0, I4
> 		sub I4, I2, I1
> 		substr S3, S0, I1, I4	
> 
> 		print "<"
> 		print S1
> 		print "><"
> 		print S2
> 		print "><"
> 		print S3
> 		print ">\\n"
> 		
> 		end
> 
> 	FAIL:	print "no match\\n"
16,19c34,48
< 	YUP:
< 		rx_info_getstartindex P0, I1
< 		rx_info_getindex P0, I2
< 		length I3, S0
---
> 
> END
> }
> 
> sub gentest_advance($$;$$) {
> 
> 	# inserts the generic advance mecahnism
> 
> 	return <<"END";
> 		set S0, "$_[0]"
> 		set I0, 0
> 
> 	START:  
> 		set I1, I0
> 		$_[1]
21c50,51
< 		rx_freeinfo P0
---
> 	SUCCEED:
> 		length I2, S0
23c53,55
< 		substr S1, S0,  0, I1
---
> 		substr S1, S0,  0, I0
> 		sub I4, I1, I0
> 		substr S2, S0, I0, I4
25,27c57
< 		substr S2, S0, I1, I4
< 		sub I4, I3, I2
< 		substr S3, S0, I2, I4	
---
> 		substr S3, S0, I1, I4	
39,41c69,71
< 	RX_0:
< 		rx_setprops P0, "$_[2]", $_[3]
< 		branch START
---
> 	FAIL:	print "no match\\n"
> 		end
> 
43,45c73,74
< 		rx_advance P0, FAIL
< 	START:
< 		$_[1]
---
> 		rx_advance S0, I0, FAIL
> 		branch START
47,51d75
< 		rx_succeed P0
< 		ret
< 	FAIL:
< 		rx_fail P0
< 		ret
54a79
> 
56c81
< 		rx_literal P0, "a", ADVANCE
---
> 		rx_search S0, I1, I0, "a", FAIL
62c87
< 		rx_literal P0, "a", ADVANCE
---
> 		rx_search S0, I1, I0, "a", FAIL
68c93
< 		rx_literal P0, "aa", ADVANCE
---
> 		rx_search S0, I1, I0, "aa", FAIL
74c99
< 		rx_literal P0, "a", ADVANCE
---
> 		rx_search S0, I1, I0, "a", FAIL
79,80c104,105
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'character classes (successful)');
< 		rx_oneof P0, "aeiou", ADVANCE
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'character classes (successful)');
> 		rx_oneof S0, I1, "aeiou", ADVANCE
85,86c110,111
< output_is(gentest('b', <<'CODE'), <<'OUTPUT', 'character classes (failure)');
< 		rx_oneof P0, "aeiou", ADVANCE
---
> output_is(gentest_advance('b', <<'CODE'), <<'OUTPUT', 'character classes (failure)');
> 		rx_oneof S0, I1, "aeiou", ADVANCE
91,95c116
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'dot (success)');
< 		rx_dot P0, ADVANCE
< CODE
< <><a><>
< OUTPUT
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'dot (success)');
97,98c118
< output_is(gentest('\n', <<'CODE'), <<'OUTPUT', 'dot (failure)');
< 		rx_dot P0, ADVANCE
---
> 		rx_dot S0, I1, ADVANCE
100c120
< no match
---
> <><a><>
103,107c123,127
< output_is(gentest('aA9_', <<'CODE'), <<'OUTPUT', '\w (success)');
< 		rx_is_w P0, ADVANCE
< 		rx_is_w P0, ADVANCE
< 		rx_is_w P0, ADVANCE
< 		rx_is_w P0, ADVANCE
---
> output_is(gentest_advance('aA9_', <<'CODE'), <<'OUTPUT', '\w (success)');
> 		rx_is_w S0, I1, ADVANCE
> 		rx_is_w S0, I1, ADVANCE
> 		rx_is_w S0, I1, ADVANCE
> 		rx_is_w S0, I1, ADVANCE
112,113c132,133
< output_is(gentest('?', <<'CODE'), <<'OUTPUT', '\w (failure)');
< 		rx_is_w P0, ADVANCE
---
> output_is(gentest_advance('?', <<'CODE'), <<'OUTPUT', '\w (failure)');
> 		rx_is_w S0, I1, ADVANCE
118,128c138,148
< output_is(gentest('0123456789', <<'CODE'), <<'OUTPUT', '\d (success)');
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
---
> output_is(gentest_advance('0123456789', <<'CODE'), <<'OUTPUT', '\d (success)');
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
133,136c153,156
< output_is(gentest('@?#', <<'CODE'), <<'OUTPUT', '\d (failure)');
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
< 		rx_is_d P0, ADVANCE
---
> output_is(gentest_advance('@?#', <<'CODE'), <<'OUTPUT', '\d (failure)');
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
> 		rx_is_d S0, I1, ADVANCE
141,142c161,162
< output_is(gentest(' ', <<'CODE'), <<'OUTPUT', '\s (success)');
< 		rx_is_s P0, ADVANCE
---
> output_is(gentest_advance(' ', <<'CODE'), <<'OUTPUT', '\s (success)');
> 		rx_is_s S0, I1, ADVANCE
147,148c167,168
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', '\s (failure)');
< 		rx_is_s P0, ADVANCE
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', '\s (failure)');
> 		rx_is_s S0, I1, ADVANCE
153,156c173,177
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'stack (pushindex/popindex)');
< 		rx_pushindex P0
< 		rx_literal P0, "a", ADVANCE
< 		rx_popindex P0, ADVANCE
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'stack (pushindex/popindex)');
> 		rx_initstack
> 		rx_pushindex I1
> 		rx_literal S0, I1, "a", ADVANCE
> 		rx_popindex I1, ADVANCE
161,166c182,187
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'stack (pushmark)');
< 		rx_pushmark P0
< 		rx_pushindex P0
< 		rx_literal P0, "a", ADVANCE
< 		rx_popindex P0, ADVANCE
< 		rx_popindex P0, ADVANCE
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'stack (pushmark)');
> 		rx_pushmark I1
> 		rx_pushindex I1
> 		rx_literal S0, I1, "a", ADVANCE
> 		rx_popindex I1, ADVANCE
> 		rx_popindex I1, ADVANCE
171,174c192,195
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'groups');
< 		rx_startgroup P0, 0
< 		rx_literal P0, "a", ADVANCE
< 		rx_endgroup P0, 0
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'groups');
> 		set I2, I1
> 		rx_literal S0, I1, "a", ADVANCE
> 		set I3, I1
176,178c197,198
< 		rx_info_getgroup P0, I1, I2, 0
< 		sub I2, I2, I1
< 		substr S1, S0, I1, I2
---
> 		sub I3, I3, I2
> 		substr S1, S0, I2, I3
187,189c207,208
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'ZWA: ^ (success)');
< 		rx_zwa_atbeginning P0, ADVANCE
< 		rx_literal P0, "a", ADVANCE
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'ZWA: ^ (success)');
> 		rx_literal S0, I1, "a", ADVANCE
194,196c213,214
< output_is(gentest('b', <<'CODE'), <<'OUTPUT', 'ZWA: ^ (failure)');
< 		rx_zwa_atbeginning P0, ADVANCE
< 		rx_literal P0, "a", ADVANCE
---
> output_is(gentest_advance('b', <<'CODE'), <<'OUTPUT', 'ZWA: ^ (failure)');
> 		rx_literal S0, I1, "a", ADVANCE
201,203c219,221
< output_is(gentest('a', <<'CODE'), <<'OUTPUT', 'ZWA: $ (success)');
< 		rx_literal P0, "a", ADVANCE
< 		rx_zwa_atend P0, ADVANCE
---
> output_is(gentest_advance('a', <<'CODE'), <<'OUTPUT', 'ZWA: $ (success)');
> 		rx_literal S0, I1, "a", ADVANCE
> 		rx_zwa_atend S0, I1, ADVANCE
208,210c226,228
< output_is(gentest('ab', <<'CODE'), <<'OUTPUT', 'ZWA: $ (failure)');
< 		rx_literal P0, "a", ADVANCE
< 		rx_zwa_atend P0, ADVANCE
---
> output_is(gentest_advance('ab', <<'CODE'), <<'OUTPUT', 'ZWA: $ (failure)');
> 		rx_literal S0, I1, "a", ADVANCE
> 		rx_zwa_atend S0, I1, ADVANCE
215,217c233,235
< output_is(gentest('a?', <<'CODE'), <<'OUTPUT', 'ZWA: \b (success)');
< 		rx_literal P0, "a", ADVANCE
< 		rx_zwa_boundary P0, ADVANCE
---
> output_is(gentest_advance('a?', <<'CODE'), <<'OUTPUT', 'ZWA: \b (success)');
> 		rx_literal S0, I1, "a", ADVANCE
> 		rx_zwa_boundary S0, I1, ADVANCE
222,224c240,242
< output_is(gentest('ab', <<'CODE'), <<'OUTPUT', 'ZWA: \b (failure)');
< 		rx_literal P0, "a", ADVANCE
< 		rx_zwa_boundary P0, ADVANCE
---
> output_is(gentest_advance('ab', <<'CODE'), <<'OUTPUT', 'ZWA: \b (failure)');
> 		rx_literal S0, I1, "a", ADVANCE
> 		rx_zwa_boundary S0, I1, ADVANCE		
229,252d246
< 
< output_is(gentest('ba', <<'CODE', 'r'), <<'OUTPUT', 'reversed regexen (/r)');
< 		rx_dot P0, ADVANCE
< CODE
< <b><a><>
< OUTPUT
< 
< output_is(gentest('\n', <<'CODE', 's'), <<'OUTPUT', 'single-line regexen (/s)');
< 		rx_dot P0, ADVANCE
< CODE
< <><
< ><>
< OUTPUT
< 
< output_is(gentest('\n\n', <<'CODE', 'm'), <<'OUTPUT', 'multiline regexen (/m)');
< 		rx_literal P0, "\n", ADVANCE
< 		rx_zwa_atbeginning P0, ADVANCE
< 		rx_zwa_atend P0, ADVANCE
< CODE
< <><
< ><
< >
< OUTPUT
< 
255c249
< 	output_is(gentest('HeLlO', <<'CODE', 'i'), <<'OUTPUT', 'case-insensitive regexen (/i)');
---
> 	output_is(gentest_advance('HeLlO', <<'CODE', 'i'), <<'OUTPUT', 'case-insensitive regexen (/i)');

examples_regex.tar.gz
Description: examples_regex.tar.gz

[perl #15797] [PATCH] Regex speedup

Reply via email to