Hello everyone, I've only been involved with parrot since last week, but I've been learning quickly from all the documentation. With the recent activity about lack of documentation, I thought I'd try to help out as best I could. I've attached a file for an rx.dev candidate. Some parts may be wrong, and at some points I even ask questions, but there isn't much to rx.[ch], so overall it should be a decent rough draft. Below are my questions, copied right out of the attached document:
1) The rx_is_number_character function breaks the abstraction and uses the following expression to test the argument: if (ch >= '0' && ch <= '9') It explains that it is "faster to do less-than/greater-than" My question is: Doesn't this restrict the ability for adding different character encodings and languge support? What about languages that don't use arabic numerals? 2) In the rxinfo struct: opcode_t *substfunc; My first guess was that that is a pointer to the first opcode the regex uses, but then I got confused by the name 'substfunc.' So basically ... what is it used for? Thanks! Stephen Rawls __________________________________________________ Do You Yahoo!? Yahoo! Autos - Get free new car price quotes http://autos.yahoo.com
rx.c / rx.h rx.c and rx.h set up functions to be used by the regular expression engine. They also define internal helper functions that add a layer of abstraction to rx_is_x family of functions. Please also see rx.ops, rxstacks.c, and rxstacks.h. rx_alloacate_info Initializes a regular expression object and allocates the memory. rx_is_word_character rx_is_number_character rx_is_whitespace_character rx_is_newline These functions check if the character passed as an argument is a word_character, number_character, whitespace_character, or a newline, respectively. They each use bitmaps to add a layer of abstraction. All a bitmap is in this case is a collection of characters. Instead of manually looking at a string these functions create a bitmap of allowable characters (using predefined constants, like RX_WORDCHARS), and call the function C<bitmap_match>, which checks if the supplied character is in the bitmap. NOTE: The rx_is_number_character function breaks the abstraction and uses the following expression to test the argument: if (ch >= '0' && ch <= '9') It explains that it is "faster to do less-than/greater-than" My question is: Doesn't this restrict the ability for adding different character encodings and languge support? What about languages that don't use arabic numerals? bitmap_make This function makes a bitmap from its argument (of type STRING*). Let us examine two cases, one is a character is one byte, the other is it is more. One byte First of all, (255 >> 3) = 31. The code uses this for a little efficiency in storage/speed. An internal array is created with 32 elements (each byte-sized). If you take the input character and right shift it by 3, you will get a number between 0 and 31, it just so happens that exactly 8 numbers between 0 and 255 map to the same number between 0 and 31. Then, each element in this array is a bitfield, with a 1 or 0 in each bit to indicate if a particular character is in the bitmap or not. So, (ch >> 3) takes us to the right element in the array for ch, but how do we get to the right element in the bitfield? The code is 1 << (ch & 7). This will give us a unique power of two for each character that maps to that particular bitfield in the array. More than one byte Here each character is appended to the internal string bigchars (of type STRING*). bitmap_make_cstr This is the same thing at bitmap_make, except it is called with a const char* argument. Because of this, it knows there will be no bigchars, so it is only concerned with byte-sized characters. bitmap_add This function takes a bitmap and a single character, and adds that character to the bitmap. The code for adding the character is the same as in bitmap_make. bitmap_match This functions takes a bitmap and a single character, and checks to see if that character is in the bitmap. If the character is more than one byte, then the function searches the bigchars string linearly (one by one). If it is a byte-sized character than it checks the appropriate bitfield, as specified in bitmap_make. bitmap_destroy This deallocates the memory for the bitmap. rx.h Here is the definition for rxinfo (all comments are mine) typedef struct rxinfo { STRING *string; //This is the string the regex tests to see if it matches or not //(At least that is what I guess, I couldn't find that anywhere) INTVAL index; //This is the current spot in string we are checking INTVAL startindex; //This is where the regex started checking INTVAL success; //This is just a flag to see if the regex matched or not rxflags flags; //This is a set of flags to see what modifiers were used in the regex UINTVAL minlength; //The minumum length string can be and still be able to match rxdirection whichway; //Is the regex going forwards or backwards? PMC *groupstart; //Indexes for where each group starts PMC *groupend; //Indexes for where each gruop ends //Groups here are capturing groups, ie. $1,$2, etc. opcode_t *substfunc; //I've got no idea what this does //My first guess was that that is a pointer to the first //opcode the regex uses, but then I got confused by the name //'substfunc.' So basically ... what is it used for? IntStack stack; //Sets up an intstack for internal use (backtrackig purposes) } rxinfo; rx.h also sets up a series of macros for setting/unsetting flags in each regex, advancing the regex one char (or a given number of chars), and finding the current index. Here is the list of the macros, check the rx.h file for their definitions. RX_dUNPACK(pmc) RxCurChar(rx) RxAdvance(rx) RxAdvanceX(rx, x) RxCaseInsensitive_on(rx) RxCaseInsensitive_off(rx) RxCaseInsensitive_test(rx) RxSingleLine_on(rx) RxSingleLine_off(rx) RxSingleLine_test(rx) RxMultiline_on(rx) RxMultiline_off(rx) RxMultiline_test(rx) RxReverse_on(rx) RxReverse_off(rx) RxReverse_test(rx) RxFlagOn(rx, flag) RxFlagOff(rx, flag) RxFlagTest(rx, flag) RxFlagsOff(rx)