I just posted an updated stack with the UTF16 and UTF32 offset variants. I did change the search on the first card to “The” and the counts remained the same so case folding does work for ASCII values. I would need some other test text to check other cases where Unicode case folding would be expected to work. I’m not surprised that it works for ASCII values since UTF16/32 will just pad with null bytes which will not impact the case folding logic.
This afternoon I think I’ll add a way to turn on/off case sensitive per card. I also want to script running tests on a single card and all cards. Thanks, Brian On Nov 11, 2018, 1:51 AM -0600, Geoff Canyon via use-livecode <use-livecode@lists.runrev.com>, wrote: > One thing I don't get is how (not) caseSensitive gets handled? Once the > text is all binary data, is the engine really still able to look at the > binary values for "A" and "a" and treat them as the same? > > On Sat, Nov 10, 2018 at 8:54 PM Brian Milby via use-livecode < > use-livecode@lists.runrev.com> wrote: > > > The correct formula for UTF16 should be: > > put tPos div 2 + 1,"" after tResult > > > > The correct formula for UTF32 should be: > > put tPos div 4 + 1,"" after tResult > > > > If you go to card #6 of my stack that is on GitHub, it has the first > > chapter of John that I copied from the internet. I added a single UTF(8?) > > character to it so get the results that are listed (last char of the first > > visible line). > > > > Making the change is dramatic. Offset takes 65k ms. Geoff's code and my > > code take 6k ms. Converting to UTF16 first and using offset takes 519 ms. > > And note that the timings include those 2 calls to convert the variables to > > UTF16. > > > > Now, if I take Geoff's sample that didn't work, I get the same incorrect > > results. The problem is that some of those characters end up needing 2 > > UTF16 codepoints to represent them. (I hope I'm using the correct > > terminology.) > > > > So, the safe solution is to always use UTF32. Speed looks to be close > > between them. On my largest text example, UTF16 took 448ms and UTF32 took > > 584ms. > > > > I'll update my stack and push an update to my repo so others can check out > > the new code. > > > > On Sat, Nov 10, 2018 at 6:05 PM Niggemann, Bernd < > > bernd.niggem...@uni-wh.de> > > wrote: > > > > > That is what I alluded to, > > > UTF is a wild country and I don't know my ways, > > > try > > > ----------------------------- > > > function allOffsets pDelim, pString, pCaseSensitive > > > local tNewPos, tPos, tResult > > > > > > put textEncode(pDelim,"UTF32") into pDelim > > > put textEncode(pString,"UTF32") into pString > > > > > > set the caseSensitive to pCaseSensitive is true > > > put 0 into tPos > > > repeat forever > > > put offset(pDelim, pString, tPos) into tNewPos > > > if tNewPos = 0 then exit repeat > > > add tNewPos to tPos > > > put tPos div 4 + tPos mod 4,"" after tResult > > > end repeat > > > if tResult is empty then return 0 > > > else return char 1 to -2 of tResult > > > end allOffsets > > > ---------------------------------- > > > > > > It teaches me to use UTF32 to be on the safe side, thank you. > > > But that should take care of it. > > > > > > Kind regards > > > Bernd > > > > > > > > > Unfortunately, I just discovered that your solution doesn't produce > > correct > > > results. If I get the offsets of "aaaaaaaaaa" in > > > "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa", > > > > > > My code (and Brian Milby's) will return: 7,8,9,10 > > > > > > Your code will return: 9,10,11,12 > > > > > > As I understand it, textEncode transforms unicode text into binary data, > > > which has the effect of speeding things up because LC is no longer > > dealing > > > with variable-byte-length characters, just the underlying (fixed-length) > > > binary data that makes them up. Hence the above discrepancy. At least I > > > think so. Maybe there's a way to fix it? > > > > > > gc > > > > > > > > > > > > Am 11.11.2018 um 00:00 schrieb Geoff Canyon <gcan...@gmail.com>: > > > > > > Unfortunately, I just discovered that your solution doesn't produce > > > correct results. If I get the offsets of "aaaaaaaaaa" in > > > "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa", > > > > > > My code (and Brian Milby's) will return: 7,8,9,10 > > > > > > Your code will return: 9,10,11,12 > > > > > > As I understand it, textEncode transforms unicode text into binary data, > > > which has the effect of speeding things up because LC is no longer > > dealing > > > with variable-byte-length characters, just the underlying (fixed-length) > > > binary data that makes them up. Hence the above discrepancy. At least I > > > think so. Maybe there's a way to fix it? > > > > > > gc > > > > > > On Sat, Nov 10, 2018 at 12:12 PM Niggemann, Bernd < > > > bernd.niggem...@uni-wh.de> wrote: > > > > > > > I figured that the slowdown was due to UTF8, for each char it has to > > test > > > > if it is a compounded character. So I just tried with utf16 figuring, > > that > > > > now it just compares at the byte-level. > > > > > > > > As it turned out it was indeed faster. > > > > > > > > Now I don't understand unicode but as I understand for some > > > > languages/signs/characters you need UTF32 to display them correctly. I > > may > > > > be wrong on that. But if it is true then the overhead to use UTF32 in > > > > textEncoding only adds a small amount to processing time. > > > > > > > > The nice thing is that UTF16 and UTF32 textencoding also support > > > > caseSensitivity. ByteOffset() for UTF16 is probably always > > case-sensitive, > > > > but only saves a small amount of processing time. > > > > > > > > Also, LC apparently has to turn ASCII into UTF8 as soon as there is one > > > > non-ASCII character in the source text. In my naive understanding LC > > could > > > > internally switch to UTF16/32 for offset() as soon as it realizes that > > UTF8 > > > > is in the source. Would make obsolete this workaround. > > > > > > > > > > > > This is just how I "think" it works, the explanation may be all wrong. > > > > > > > > Kind regards > > > > > > > > Bernd > > > > > > > > Am 10.11.2018 um 20:30 schrieb Geoff Canyon <gcan...@gmail.com>: > > > > > > > > This is faster -- under some circumstances, much faster! Any idea why > > > > textEncoding suddenly fixes everything? > > > > > > > > On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode < > > > > use-livecode@lists.runrev.com> wrote: > > > > > > > > > This is a little late but there was a discussion about the slowness of > > > > > simple offset() when dealing with text that contains Unicode > > characters. > > > > > > > > > > Geoff Canyon and Brian Milby found a faster solution by setting the > > > > > itemDelimiter to the search string. > > > > > They even provided a way to find the position of substrings in the > > > > > search string which the offset() command does by design. > > > > > > > > > > Here I propose a variant of the offset() form that uses UTF16 to > > search, > > > > > easily adaptable to UTF32 if necessary. > > > > > > > > > > To test (as in Brian's testStack) add a unicode character to the text > > to > > > > > be searched e.g. at the end. Just any non-ASCII character to see the > > speed > > > > > penalty of simple offset(). I used ð (Icelandic d) or use any chinese > > > > > character. > > > > > > > > > > > > > > > Kind regards > > > > > Bernd > > > > > > > > > > ------------------------------------------- > > > > > function allOffsets pDelim, pString, pCaseSensitive > > > > > local tNewPos, tPos, tResult > > > > > > > > > > put textEncode(pDelim,"UTF16") into pDelim > > > > > put textEncode(pString,"UTF16") into pString > > > > > > > > > > set the caseSensitive to pCaseSensitive is true > > > > > put 0 into tPos > > > > > repeat forever > > > > > put offset(pDelim, pString, tPos) into tNewPos > > > > > if tNewPos = 0 then exit repeat > > > > > add tNewPos to tPos > > > > > put tPos div 2 + tPos mod 2,"" after tResult > > > > > end repeat > > > > > if tResult is empty then return 0 > > > > > else return char 1 to -2 of tResult > > > > > end allOffsets > > > > > ---------------------------------------- > > > > > > > > > > > > > > _______________________________________________ > > use-livecode mailing list > > use-livecode@lists.runrev.com > > Please visit this url to subscribe, unsubscribe and manage your > > subscription preferences: > > http://lists.runrev.com/mailman/listinfo/use-livecode > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode