string.c questions

Benjamin Goldberg Sat, 02 Aug 2003 19:03:10 -0700

Actually, these are mostly questions about the string_str_index
function.

I've some questions about bufstart, strstart, bufused, strlen and
encoding->characters?


I *think* that ->characters is a fuction which gets passed a pointer to
the start of a buffer, and the number of bytes in the buffer, and
returns the number of characters in the buffer... right?

Can we assert that for any string s, s->strlen ==
s->encoding->characters( s->strstart, s->bufused)?  If so, where would
this be documented?

Also, although we're told at the top of string.c to not look at
s->bufstart or s->buflen, I'd like to know if we are allowed to
assume/assert that for all strings, the following is true:

   s->encoding->skip_forward( s->strstart, s->strlen ) ==
      (char*)s->bufstart + s->bufused

If so, this makes finding the end of a string significantly easier.

In string_str_index_multibyte, the lastmatch variable is calculated as:

    const void* const lastmatch =
       str->encoding->skip_backward((char*)str->strstart + str->strlen,
          find->encoding->characters(find, find->strlen));

There seems to be quite a bit of confusion on this line about bytes and
characters... the goal here seems to be to find a pointer to the last
place where it would be possible to begin a match.

What's with find and ->characters?  Shouldn't find->strlen be
sufficient, without all that other stuff around it?  Next...

If these weren't multibyte strings, then this would be (str->strstart +
str->strlen - find->strlen), right?  Or, translating that literally (and
doing the subtraction first):

    const void* const lastmatch = str->encoding->skip_forward(
       str->strstart, str->strlen - find->strlen );

Or, if we can do that trick for finding the end of a string:

    const void* const lastmatch = str->encoding->skip_backward(
       (char*)str->bufstart + str->bufused, find->strlen );

Similarly, the lastfind variable should either be:

    const void* const lastfind = find->encoding->skip_forward(
       find->strlen );

Or:

    const void* const lastfind = (char*)find->bufstart + find->bufused;

In the string_str_index function, I see:

    if (!s || !string_length(s))
        return -1;
    if (!s2 || !string_length(s2))
        return -1;

I would think that the function should start:

   /* the empty string is a substring of *every* string. */
   if( !s2 || !s2->strlen )
      return s ? MIN( s->strlen, start ) : 0;
   /* you can't find a big string inside a little one. */
   if( !s || s->strlen - start < s2->strlen )
      return -1;

If we left this as it originally was... consider what happens if these
are multibyte strings, and s2 is much larger than s... our 'lastmatch'
variable will be well before the beginning of the buffer... and
*reaching* it, either through skip_forward or skip_backward, would mean
walking through memory that wasn't our own.

Next Q: Where's the docs on what functions are range checked, and what
functions will seg fault if called with bad params?  If string_index(s,
10000) is called on a short string, it'll segfault.  This isn't
mentioned in string_index's docs, afaiks.

Why aren't there string_(chr_r?index|str_rindex) functions?  ENOTUITS on
the part of whoever's in charge of string.{c,ops}?  These would be quite
useful in making the rx_ ops faster.  (Note that string_str_index is
useful for /foo/ or for /^.*?foo/, but not as good for /.*foo/ ... for
that, you'd want to use a string_str_rindex).

I'd offer some impls for these, but sadly, I've a case of ENOTUITS
myself.

More possible additions to the string_ repertoire might be
string_(starts|ends)with, to speed up /^foo/ and /bar$/ matches.  Or a
"limit" argument to string_whatever_r?index, for things like
/^.{0,5}foo/.

-- 
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED]
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

string.c questions

Reply via email to