Benjamin Golberg writes: > Actually, these are mostly questions about the string_str_index > function.
Uh oh... > I've some questions about bufstart, strstart, bufused, strlen and > encoding->characters? > > In string_str_index_multibyte, the lastmatch variable is calculated as: > > const void* const lastmatch = > str->encoding->skip_backward((char*)str->strstart + str->strlen, > find->encoding->characters(find, find->strlen)); > > There seems to be quite a bit of confusion on this line about bytes and > characters... the goal here seems to be to find a pointer to the last > place where it would be possible to begin a match. Yep. You're right, there is a bit of confusion about characters and bytes in this statement -- mostly because I'm confused about characters and bytes in Parrot. So... str->strlen is the number of *characters* in the string? Hmm.. that changes things. Maybe someone else should fix this -- who knows what they're doing :-) Do we have tests for multibyte string operations in the test suite? > What's with find and ->characters? Shouldn't find->strlen be > sufficient, without all that other stuff around it? Next... If find->strlen represents the number of characters as you say, then yes. > If these weren't multibyte strings, then this would be (str->strstart + > str->strlen - find->strlen), right? Or, translating that literally (and > doing the subtraction first): > > const void* const lastmatch = str->encoding->skip_forward( > str->strstart, str->strlen - find->strlen ); Yeah, the thing about that is, for strings in UTF formats, skip_forward is a linear time operation, which is pretty expensive when there's a lot of data. That's why I used pointers in this function instead of string_index as the previous implementation did. > Or, if we can do that trick for finding the end of a string: > > const void* const lastmatch = str->encoding->skip_backward( > (char*)str->bufstart + str->bufused, find->strlen ); > > Similarly, the lastfind variable should either be: > > const void* const lastfind = find->encoding->skip_forward( > find->strlen ); skip_forward takes 2 args, I assume you mean: const void* const lastfind = find->encoding->skip_forward( find, find->strlen); Again, that's linear time. But usually the string to find won't be that long, so it's not so important in this case. But your shortcut would still be faster. > Or: > > const void* const lastfind = (char*)find->bufstart + find->bufused; > > In the string_str_index function, I see: > > if (!s || !string_length(s)) > return -1; > if (!s2 || !string_length(s2)) > return -1; > > I would think that the function should start: > > /* the empty string is a substring of *every* string. */ I think the semantics described were to return a non-match if find was empty. But, mathematically consistent results are generally more useful, so it's probably okay to change. > if( !s2 || !s2->strlen ) > return s ? MIN( s->strlen, start ) : 0; > /* you can't find a big string inside a little one. */ > if( !s || s->strlen - start < s2->strlen ) > return -1; I could have sworn I included those checks... > If we left this as it originally was... consider what happens if these > are multibyte strings, and s2 is much larger than s... our 'lastmatch' > variable will be well before the beginning of the buffer... and > *reaching* it, either through skip_forward or skip_backward, would mean > walking through memory that wasn't our own. Luke