Re: How to find offsets in Unicode Text fast

Niggemann, Bernd via use-livecode Sat, 10 Nov 2018 16:08:23 -0800

That is what I alluded to,
UTF is  a wild country and I don't know my ways,
try
-----------------------------
function allOffsets pDelim, pString, pCaseSensitive
   local tNewPos, tPos, tResult


   put textEncode(pDelim,"UTF32") into pDelim
   put textEncode(pString,"UTF32") into pString

   set the caseSensitive to pCaseSensitive is true
   put 0 into tPos
   repeat forever
      put offset(pDelim, pString, tPos) into tNewPos
      if tNewPos = 0 then exit repeat
      add tNewPos to tPos
      put tPos div 4 + tPos mod 4,"" after tResult
   end repeat
   if tResult is empty then return 0
   else return char 1 to -2 of tResult
end allOffsets
----------------------------------

It teaches me to use UTF32 to be on the safe side, thank you.
But that should take care of it.

Kind regards
Bernd



Unfortunately, I just discovered that your solution doesn't produce correct
results. If I get the offsets of "aaaaaaaaaa" in
"↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",

My code (and Brian Milby's) will return: 7,8,9,10

Your code will return: 9,10,11,12

As I understand it, textEncode transforms unicode text into binary data,
which has the effect of speeding things up because LC is no longer dealing
with variable-byte-length characters, just the underlying (fixed-length)
binary data that makes them up. Hence the above discrepancy. At least I
think so. Maybe there's a way to fix it?

gc



Am 11.11.2018 um 00:00 schrieb Geoff Canyon 
<gcan...@gmail.com<mailto:gcan...@gmail.com>>:

Unfortunately, I just discovered that your solution doesn't produce correct 
results. If I get the offsets of "aaaaaaaaaa" in "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",

My code (and Brian Milby's) will return: 7,8,9,10

Your code will return: 9,10,11,12

As I understand it, textEncode transforms unicode text into binary data, which 
has the effect of speeding things up because LC is no longer dealing with 
variable-byte-length characters, just the underlying (fixed-length) binary data 
that makes them up. Hence the above discrepancy. At least I think so. Maybe 
there's a way to fix it?

gc

On Sat, Nov 10, 2018 at 12:12 PM Niggemann, Bernd 
<bernd.niggem...@uni-wh.de<mailto:bernd.niggem...@uni-wh.de>> wrote:
I figured that the slowdown was due to UTF8, for each char it has to test if it 
is a compounded character. So I just tried with utf16 figuring, that now it 
just compares at the byte-level.

As it turned out it was indeed faster.

Now I don't understand unicode but as I understand for some 
languages/signs/characters you need UTF32 to display them correctly. I may be 
wrong on that. But if it is true then the overhead to use UTF32 in textEncoding 
only adds a small amount to processing time.

The nice thing is that UTF16 and UTF32 textencoding also support 
caseSensitivity.  ByteOffset() for UTF16 is probably always case-sensitive, but 
only saves a small amount of processing time.

Also, LC apparently has to turn ASCII into UTF8 as soon as there is one 
non-ASCII character in the source text. In my naive understanding LC could 
internally switch to UTF16/32 for offset() as soon as it realizes that UTF8 is 
in the source. Would make obsolete this workaround.


This is just how I "think" it works, the explanation may be all wrong.

Kind regards

Bernd

Am 10.11.2018 um 20:30 schrieb Geoff Canyon 
<gcan...@gmail.com<mailto:gcan...@gmail.com>>:

This is faster -- under some circumstances, much faster! Any idea why 
textEncoding suddenly fixes everything?

On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode 
<use-livecode@lists.runrev.com<mailto:use-livecode@lists.runrev.com>> wrote:
This is a little late but there was a discussion about the slowness of simple 
offset() when dealing with text that contains Unicode characters.

Geoff Canyon and Brian Milby found a faster solution by setting the 
itemDelimiter to the search string.
They even provided a way to find the position of substrings in the search 
string which the offset() command does by design.

Here I propose a variant of the offset() form that uses UTF16 to search, easily 
adaptable to UTF32 if necessary.

To test (as in Brian's testStack) add a unicode character to the text to be 
searched e.g. at the end. Just any non-ASCII character to see the speed penalty 
of simple offset(). I used ð (Icelandic d) or use any chinese character.


Kind regards
Bernd

-------------------------------------------
function allOffsets pDelim, pString, pCaseSensitive
   local tNewPos, tPos, tResult

   put textEncode(pDelim,"UTF16") into pDelim
   put textEncode(pString,"UTF16") into pString

   set the caseSensitive to pCaseSensitive is true
   put 0 into tPos
   repeat forever
      put offset(pDelim, pString, tPos) into tNewPos
      if tNewPos = 0 then exit repeat
      add tNewPos to tPos
      put tPos div 2 + tPos mod 2,"" after tResult
   end repeat
   if tResult is empty then return 0
   else return char 1 to -2 of tResult
end allOffsets
----------------------------------------

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to find offsets in Unicode Text fast

Reply via email to