I figured that the slowdown was due to UTF8, for each char it has to test if it is a compounded character. So I just tried with utf16 figuring, that now it just compares at the byte-level.
As it turned out it was indeed faster. Now I don't understand unicode but as I understand for some languages/signs/characters you need UTF32 to display them correctly. I may be wrong on that. But if it is true then the overhead to use UTF32 in textEncoding only adds a small amount to processing time. The nice thing is that UTF16 and UTF32 textencoding also support caseSensitivity. ByteOffset() for UTF16 is probably always case-sensitive, but only saves a small amount of processing time. Also, LC apparently has to turn ASCII into UTF8 as soon as there is one non-ASCII character in the source text. In my naive understanding LC could internally switch to UTF16/32 for offset() as soon as it realizes that UTF8 is in the source. Would make obsolete this workaround. This is just how I "think" it works, the explanation may be all wrong. Kind regards Bernd Am 10.11.2018 um 20:30 schrieb Geoff Canyon <gcan...@gmail.com<mailto:gcan...@gmail.com>>: This is faster -- under some circumstances, much faster! Any idea why textEncoding suddenly fixes everything? On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode <use-livecode@lists.runrev.com<mailto:use-livecode@lists.runrev.com>> wrote: This is a little late but there was a discussion about the slowness of simple offset() when dealing with text that contains Unicode characters. Geoff Canyon and Brian Milby found a faster solution by setting the itemDelimiter to the search string. They even provided a way to find the position of substrings in the search string which the offset() command does by design. Here I propose a variant of the offset() form that uses UTF16 to search, easily adaptable to UTF32 if necessary. To test (as in Brian's testStack) add a unicode character to the text to be searched e.g. at the end. Just any non-ASCII character to see the speed penalty of simple offset(). I used ð (Icelandic d) or use any chinese character. Kind regards Bernd ------------------------------------------- function allOffsets pDelim, pString, pCaseSensitive local tNewPos, tPos, tResult put textEncode(pDelim,"UTF16") into pDelim put textEncode(pString,"UTF16") into pString set the caseSensitive to pCaseSensitive is true put 0 into tPos repeat forever put offset(pDelim, pString, tPos) into tNewPos if tNewPos = 0 then exit repeat add tNewPos to tPos put tPos div 2 + tPos mod 2,"" after tResult end repeat if tResult is empty then return 0 else return char 1 to -2 of tResult end allOffsets ----------------------------------------- _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com<mailto:use-livecode@lists.runrev.com> Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode