Unfortunately, I just discovered that your solution doesn't produce correct results. If I get the offsets of "aaaaaaaaaa" in "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",
My code (and Brian Milby's) will return: 7,8,9,10 Your code will return: 9,10,11,12 As I understand it, textEncode transforms unicode text into binary data, which has the effect of speeding things up because LC is no longer dealing with variable-byte-length characters, just the underlying (fixed-length) binary data that makes them up. Hence the above discrepancy. At least I think so. Maybe there's a way to fix it? gc On Sat, Nov 10, 2018 at 12:12 PM Niggemann, Bernd <bernd.niggem...@uni-wh.de> wrote: > I figured that the slowdown was due to UTF8, for each char it has to test > if it is a compounded character. So I just tried with utf16 figuring, that > now it just compares at the byte-level. > > As it turned out it was indeed faster. > > Now I don't understand unicode but as I understand for some > languages/signs/characters you need UTF32 to display them correctly. I may > be wrong on that. But if it is true then the overhead to use UTF32 in > textEncoding only adds a small amount to processing time. > > The nice thing is that UTF16 and UTF32 textencoding also support > caseSensitivity. ByteOffset() for UTF16 is probably always case-sensitive, > but only saves a small amount of processing time. > > Also, LC apparently has to turn ASCII into UTF8 as soon as there is one > non-ASCII character in the source text. In my naive understanding LC could > internally switch to UTF16/32 for offset() as soon as it realizes that UTF8 > is in the source. Would make obsolete this workaround. > > > This is just how I "think" it works, the explanation may be all wrong. > > Kind regards > > Bernd > > Am 10.11.2018 um 20:30 schrieb Geoff Canyon <gcan...@gmail.com>: > > This is faster -- under some circumstances, much faster! Any idea why > textEncoding suddenly fixes everything? > > On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode < > use-livecode@lists.runrev.com> wrote: > >> This is a little late but there was a discussion about the slowness of >> simple offset() when dealing with text that contains Unicode characters. >> >> Geoff Canyon and Brian Milby found a faster solution by setting the >> itemDelimiter to the search string. >> They even provided a way to find the position of substrings in the search >> string which the offset() command does by design. >> >> Here I propose a variant of the offset() form that uses UTF16 to search, >> easily adaptable to UTF32 if necessary. >> >> To test (as in Brian's testStack) add a unicode character to the text to >> be searched e.g. at the end. Just any non-ASCII character to see the speed >> penalty of simple offset(). I used ð (Icelandic d) or use any chinese >> character. >> >> >> Kind regards >> Bernd >> >> ------------------------------------------- >> function allOffsets pDelim, pString, pCaseSensitive >> local tNewPos, tPos, tResult >> >> put textEncode(pDelim,"UTF16") into pDelim >> put textEncode(pString,"UTF16") into pString >> >> set the caseSensitive to pCaseSensitive is true >> put 0 into tPos >> repeat forever >> put offset(pDelim, pString, tPos) into tNewPos >> if tNewPos = 0 then exit repeat >> add tNewPos to tPos >> put tPos div 2 + tPos mod 2,"" after tResult >> end repeat >> if tResult is empty then return 0 >> else return char 1 to -2 of tResult >> end allOffsets >> ----------------------------------------- >> _______________________________________________ >> use-livecode mailing list >> use-livecode@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your >> subscription preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode > > > _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode