Actually, these are mostly questions about the string_str_index function. I've some questions about bufstart, strstart, bufused, strlen and encoding->characters?
I *think* that ->characters is a fuction which gets passed a pointer to the start of a buffer, and the number of bytes in the buffer, and returns the number of characters in the buffer... right? Can we assert that for any string s, s->strlen == s->encoding->characters( s->strstart, s->bufused)? If so, where would this be documented? Also, although we're told at the top of string.c to not look at s->bufstart or s->buflen, I'd like to know if we are allowed to assume/assert that for all strings, the following is true: s->encoding->skip_forward( s->strstart, s->strlen ) == (char*)s->bufstart + s->bufused If so, this makes finding the end of a string significantly easier. In string_str_index_multibyte, the lastmatch variable is calculated as: const void* const lastmatch = str->encoding->skip_backward((char*)str->strstart + str->strlen, find->encoding->characters(find, find->strlen)); There seems to be quite a bit of confusion on this line about bytes and characters... the goal here seems to be to find a pointer to the last place where it would be possible to begin a match. What's with find and ->characters? Shouldn't find->strlen be sufficient, without all that other stuff around it? Next... If these weren't multibyte strings, then this would be (str->strstart + str->strlen - find->strlen), right? Or, translating that literally (and doing the subtraction first): const void* const lastmatch = str->encoding->skip_forward( str->strstart, str->strlen - find->strlen ); Or, if we can do that trick for finding the end of a string: const void* const lastmatch = str->encoding->skip_backward( (char*)str->bufstart + str->bufused, find->strlen ); Similarly, the lastfind variable should either be: const void* const lastfind = find->encoding->skip_forward( find->strlen ); Or: const void* const lastfind = (char*)find->bufstart + find->bufused; In the string_str_index function, I see: if (!s || !string_length(s)) return -1; if (!s2 || !string_length(s2)) return -1; I would think that the function should start: /* the empty string is a substring of *every* string. */ if( !s2 || !s2->strlen ) return s ? MIN( s->strlen, start ) : 0; /* you can't find a big string inside a little one. */ if( !s || s->strlen - start < s2->strlen ) return -1; If we left this as it originally was... consider what happens if these are multibyte strings, and s2 is much larger than s... our 'lastmatch' variable will be well before the beginning of the buffer... and *reaching* it, either through skip_forward or skip_backward, would mean walking through memory that wasn't our own. Next Q: Where's the docs on what functions are range checked, and what functions will seg fault if called with bad params? If string_index(s, 10000) is called on a short string, it'll segfault. This isn't mentioned in string_index's docs, afaiks. Why aren't there string_(chr_r?index|str_rindex) functions? ENOTUITS on the part of whoever's in charge of string.{c,ops}? These would be quite useful in making the rx_ ops faster. (Note that string_str_index is useful for /foo/ or for /^.*?foo/, but not as good for /.*foo/ ... for that, you'd want to use a string_str_rindex). I'd offer some impls for these, but sadly, I've a case of ENOTUITS myself. More possible additions to the string_ repertoire might be string_(starts|ends)with, to speed up /^foo/ and /bar$/ matches. Or a "limit" argument to string_whatever_r?index, for things like /^.{0,5}foo/. -- $a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca );{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED] ]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}