[PATCH] rx.dev

Stephen Rawls Sun, 14 Jul 2002 09:18:39 -0700

Hello everyone,
  I've only been involved with parrot since last week,
but I've been learning quickly from all the
documentation.  With the recent activity about lack of
documentation, I thought I'd try to help out as best I
could.  I've attached a file for an rx.dev candidate. 
Some parts may be wrong, and at some points I even ask
questions, but there isn't much to rx.[ch], so overall
it should be a decent rough draft.  Below are my
questions, copied right out of the attached document:


1) The rx_is_number_character function breaks the
abstraction and uses the following expression to test
the argument:
if (ch >= '0' && ch <= '9')
It explains that it is "faster to do
less-than/greater-than"
My question is: Doesn't this restrict the ability for
adding different character encodings and languge
support?  What about languages that don't use arabic
numerals?

2) In the rxinfo struct:
opcode_t *substfunc;

My first guess was that that is a pointer to the first
opcode the regex uses, but then I got confused by the
name 'substfunc.'  So basically ... what is it used
for?

Thanks!
Stephen Rawls


__________________________________________________
Do You Yahoo!?
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com

rx.c / rx.h

rx.c and rx.h set up functions to be used by the regular expression engine.  They also 
define internal helper functions that add a layer of abstraction to rx_is_x family of 
functions.  Please also see rx.ops, rxstacks.c, and rxstacks.h.

rx_alloacate_info

Initializes a regular expression object and allocates the memory.

rx_is_word_character
rx_is_number_character
rx_is_whitespace_character
rx_is_newline

These functions check if the character passed as an argument is a word_character, 
number_character, whitespace_character, or a newline, respectively.  They each use 
bitmaps to add a layer of abstraction.  All a bitmap is in this case is a collection 
of characters.  Instead of manually looking at a string these functions create a 
bitmap of allowable characters (using predefined constants, like RX_WORDCHARS), and 
call the function C<bitmap_match>, which checks if the supplied character is in the 
bitmap.


NOTE: The rx_is_number_character function breaks the abstraction and uses the 
following expression to test the argument:
if (ch >= '0' && ch <= '9')
It explains that it is "faster to do less-than/greater-than"
My question is: Doesn't this restrict the ability for adding different character 
encodings and languge support?  What about languages that don't use arabic numerals?

bitmap_make

This function makes a bitmap from its argument (of type STRING*).  Let us examine two 
cases, one is a character is one byte, the other is it is more.

One byte
First of all, (255 >> 3) = 31.  The code uses this for a little efficiency in 
storage/speed.  An internal array is created with 32 elements (each byte-sized).  If 
you take the input character and right shift it by 3, you will get a number between 0 
and 31, it just so happens that exactly 8 numbers between 0 and 255 map to the same 
number between 0 and 31.  Then, each element in this array is a bitfield, with a 1 or 
0 in each bit to indicate if a particular character is in the bitmap or not.  So, (ch 
>> 3) takes us to the right element in the array for ch, but how do we get to the 
right element in the bitfield?  The code is 1 << (ch & 7).  This will give us a unique 
power of two for each character that maps to that particular bitfield in the array.

More than one byte
Here each character is appended to the internal string bigchars (of type STRING*).

bitmap_make_cstr
This is the same thing at bitmap_make, except it is called with a const char* 
argument.  Because of this, it knows there will be no bigchars, so it is only 
concerned with byte-sized characters.

bitmap_add
This function takes a bitmap and a single character, and adds that character to the 
bitmap.  The code for adding the character is the same as in bitmap_make.

bitmap_match
This functions takes a bitmap and a single character, and checks to see if that 
character is in the bitmap.  If the character is more than one byte, then the function 
searches the bigchars string linearly (one by one).  If it is a byte-sized character 
than it checks the appropriate bitfield, as specified in bitmap_make.

bitmap_destroy
This deallocates the memory for the bitmap.


rx.h

Here is the definition for rxinfo (all comments are mine)

typedef struct rxinfo {
    STRING *string;     //This is the string the regex tests to see if it matches or 
not
                        //(At least that is what I guess, I couldn't find that 
anywhere)
    INTVAL index;       //This is the current spot in string we are checking
    INTVAL startindex;  //This is where the regex started checking
    INTVAL success;     //This is just a flag to see if the regex matched or not

    rxflags flags;      //This is a set of flags to see what modifiers were used in 
the regex
    UINTVAL minlength;  //The minumum length string can be and still be able to match
    rxdirection whichway; //Is the regex going forwards or backwards?

    PMC *groupstart;    //Indexes for where each group starts
    PMC *groupend;      //Indexes for where each gruop ends
                        //Groups here are capturing groups, ie. $1,$2, etc.

    opcode_t *substfunc; //I've got no idea what this does
                         //My first guess was that that is a pointer to the first 
                         //opcode the regex uses, but then I got confused by the name  
                                 //'substfunc.'  So basically ... what is it used for?


    IntStack stack;     //Sets up an intstack for internal use (backtrackig purposes)
} rxinfo;


rx.h also sets up a series of macros for setting/unsetting flags in each regex, 
advancing the regex one char (or a given number of chars), and finding the current 
index.  Here is the list of the macros, check the rx.h file for their definitions.

RX_dUNPACK(pmc)
RxCurChar(rx)
RxAdvance(rx)
RxAdvanceX(rx, x)
RxCaseInsensitive_on(rx)
RxCaseInsensitive_off(rx)
RxCaseInsensitive_test(rx)
RxSingleLine_on(rx)
RxSingleLine_off(rx)
RxSingleLine_test(rx)
RxMultiline_on(rx)
RxMultiline_off(rx)
RxMultiline_test(rx)
RxReverse_on(rx)
RxReverse_off(rx)
RxReverse_test(rx)
RxFlagOn(rx, flag)
RxFlagOff(rx, flag)
RxFlagTest(rx, flag)
RxFlagsOff(rx)

[PATCH] rx.dev

Reply via email to