Re: How to find offsets in Unicode Text fast

Geoff Canyon via use-livecode Mon, 12 Nov 2018 21:20:02 -0800

A few things:

1. It seems codepointOffset can only find a single character? So it
won't work for any search for a multi-character string?
2: codepointOffset seems to work differently for multi-byte characters and
regular characters:


put codepointoffset("e","↘ndatestest",6) -- puts 3
put codepointoffset("e","andatestest",6) -- puts 9

3: It seems that when multi-byte characters are involved, codepointOffset
suffers from the same sort of slow-down as offset does. For example, in a
145K string with about 20K hits for a single character, a simple
codepointOffset routine (below) takes over 10 seconds, while the item-based
routine takes about 0.3 seconds for the same results.

On Mon, Nov 12, 2018 at 4:21 PM Monte Goulding via use-livecode <
use-livecode@lists.runrev.com> wrote:

> Hi Folks
>
> I was a bit perplexed by this so I had a quick look about the engine and I
> see the issue. The problem is you are using `offset` which works on
> characters. Characters in LiveCode are neither unicode codepoints or bytes.
> They are graphemes. This means that when you have chars to skip the entire
> string needs to be parsed to find the grapheme boundaries so that the index
> can be translated into graphemes to skip. Note that if the strings you were
> dealing with weren’t unicode then the translation of chars to graphemes is
> 1 -> 1 so there’s no big cost which is why things are much faster when you
> textEncode and offset that.
>
> So! Change to using codepointOffset and hopefully it will be much speedier!
>
> Cheers
>
> Monte
> _______________________________________________
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to find offsets in Unicode Text fast

Reply via email to