Re: string.c questions

Luke Palmer Sat, 02 Aug 2003 19:30:45 -0700

Benjamin Golberg writes:
> Actually, these are mostly questions about the string_str_index
> function.


Uh oh...

> I've some questions about bufstart, strstart, bufused, strlen and
> encoding->characters?
> 
> In string_str_index_multibyte, the lastmatch variable is calculated as:
> 
>     const void* const lastmatch =
>        str->encoding->skip_backward((char*)str->strstart + str->strlen,
>           find->encoding->characters(find, find->strlen));
> 
> There seems to be quite a bit of confusion on this line about bytes and
> characters... the goal here seems to be to find a pointer to the last
> place where it would be possible to begin a match.

Yep.

You're right, there is a bit of confusion about characters and bytes
in this statement -- mostly because I'm confused about characters and
bytes in Parrot.  So... str->strlen is the number of *characters* in
the string?  Hmm.. that changes things.

Maybe someone else should fix this -- who knows what they're doing :-)
Do we have tests for multibyte string operations in the test suite?

> What's with find and ->characters?  Shouldn't find->strlen be
> sufficient, without all that other stuff around it?  Next...

If find->strlen represents the number of characters as you say, then
yes.

> If these weren't multibyte strings, then this would be (str->strstart +
> str->strlen - find->strlen), right?  Or, translating that literally (and
> doing the subtraction first):
> 
>     const void* const lastmatch = str->encoding->skip_forward(
>        str->strstart, str->strlen - find->strlen );

Yeah, the thing about that is, for strings in UTF formats,
skip_forward is a linear time operation, which is pretty expensive
when there's a lot of data.  That's why I used pointers in this
function instead of string_index as the previous implementation did.

> Or, if we can do that trick for finding the end of a string:
> 
>     const void* const lastmatch = str->encoding->skip_backward(
>        (char*)str->bufstart + str->bufused, find->strlen );
> 
> Similarly, the lastfind variable should either be:
> 
>     const void* const lastfind = find->encoding->skip_forward(
>        find->strlen );

skip_forward takes 2 args, I assume you mean:

    const void* const lastfind = find->encoding->skip_forward(
        find, find->strlen);

Again, that's linear time.  But usually the string to find won't be
that long, so it's not so important in this case.  But your shortcut
would still be faster.

> Or:
> 
>     const void* const lastfind = (char*)find->bufstart + find->bufused;
> 
> In the string_str_index function, I see:
> 
>     if (!s || !string_length(s))
>         return -1;
>     if (!s2 || !string_length(s2))
>         return -1;
> 
> I would think that the function should start:
> 
>    /* the empty string is a substring of *every* string. */

I think the semantics described were to return a non-match if find was
empty.  But, mathematically consistent results are generally more
useful, so it's probably okay to change.

>    if( !s2 || !s2->strlen )
>       return s ? MIN( s->strlen, start ) : 0;
>    /* you can't find a big string inside a little one. */
>    if( !s || s->strlen - start < s2->strlen )
>       return -1;

I could have sworn I included those checks...

> If we left this as it originally was... consider what happens if these
> are multibyte strings, and s2 is much larger than s... our 'lastmatch'
> variable will be well before the beginning of the buffer... and
> *reaching* it, either through skip_forward or skip_backward, would mean
> walking through memory that wasn't our own.

Luke

Re: string.c questions

Reply via email to