Re: How to find offsets in Unicode Text fast

2018-11-13 Thread Geoff Canyon via use-livecode
I didn't realize this conversation was just between Bernd and me, so here it is for the list. Bernd found a solution for the Reykjavík issue (seemingly -- it works, but it's weird) and based on a conversation in another thread I have a solution for non-case-sensitive matching. So the UTF-32 version

Re: How to find offsets in Unicode Text fast

2018-11-13 Thread Mark Waddingham via use-livecode
On 2018-11-13 11:37, Geoff Canyon via use-livecode wrote: I understand (generally) the complexity of comparison, but that's not the speed issue causing this discussion. Most of the proposed solutions are using nearly the same operators/functions for comparison, or at least the same comparison

Re: How to find offsets in Unicode Text fast

2018-11-13 Thread Geoff Canyon via use-livecode
A lot of useful points here, thanks. The caseSensitive vs. binary has been covered in the other discussion -- Monte said that by using offset() "the engine will convert your data to a string and assume native encoding. This is probably why you are getting some case insensitivity." I understand (

Re: How to find offsets in Unicode Text fast

2018-11-13 Thread Mark Waddingham via use-livecode
On 2018-11-13 01:06, Geoff Canyon via use-livecode wrote: On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode < use-livecode@lists.runrev.com> wrote: I'm really confused that case-insensitive should work at all for UTF-16 or UTF-32; The caseSensitive (and formSensitive) properti

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Mark Waddingham via use-livecode
On 2018-11-13 06:35, Geoff Canyon via use-livecode wrote: I didn't realize until now that offset() simply fails with some unicode strings: put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0 On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon wrote: A few things: 1. It seems codepointOffset can

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Geoff Canyon via use-livecode
I didn't realize until now that offset() simply fails with some unicode strings: put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0 On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon wrote: > A few things: > > 1. It seems codepointOffset can only find a single character? So it > won't work for an

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Geoff Canyon via use-livecode
A few things: 1. It seems codepointOffset can only find a single character? So it won't work for any search for a multi-character string? 2: codepointOffset seems to work differently for multi-byte characters and regular characters: put codepointoffset("e","↘ndatestest",6) -- puts 3 put codepoint

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Monte Goulding via use-livecode
Hi Folks I was a bit perplexed by this so I had a quick look about the engine and I see the issue. The problem is you are using `offset` which works on characters. Characters in LiveCode are neither unicode codepoints or bytes. They are graphemes. This means that when you have chars to skip the

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Geoff Canyon via use-livecode
On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode < use-livecode@lists.runrev.com> wrote: > I'm really confused that case-insensitive should work at all for UTF-16 or > UTF-32; This is so puzzling. I tried this code in a button: on mouseUp put "Ѡ" into x put "ѡ" into y -

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Geoff Canyon via use-livecode
On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode < use-livecode@lists.runrev.com> wrote: > > I'm really confused that case-insensitive should work at all for UTF-16 or > UTF-32; at this point as far as I understand it, LC has no idea that how > to > correctly interpret the value of

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Niggemann, Bernd via use-livecode
Ben, Please see my remarks out failing UTF-32 with some Icelandic characters. Currently I would not recommend offset(UTF-32 text) unless one knows which character set is suited to be used and is in control of that character set. The same goes for UTF-16. I also thought that byteOffset would be

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Ben Rubinstein via use-livecode
Coming late to this discussion. Very excited by this approach of converting everything to UTF-32 in order to do fast offsets. I'm really confused that case-insensitive should work at all for UTF-16 or UTF-32; at this point as far as I understand it, LC has no idea that how to correctly interpr

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Brian Milby via use-livecode
Am 12.11.2018 um 12:00 schrieb use-livecode-requ...@lists.runrev.com: > > From: Brian Milby > To: How to use LiveCode > Subject: Re: How to find offsets in Unicode Text fast > > > I just tried one additional test. Search for "åå" within "aaååÅÅååaa". > (On a M

Re: How to find offsets in Unicode Text fast

2018-11-12 Thread Niggemann, Bernd via use-livecode
ext. Kind regards Bernd Am 12.11.2018 um 12:00 schrieb use-livecode-requ...@lists.runrev.com<mailto:use-livecode-requ...@lists.runrev.com>: From: Brian Milby To: How to use LiveCode mailto:use-livecode@lists.runrev.com>> Subject: Re: How to find offsets in Unicode Text fast I just

Re: How to find offsets in Unicode Text fast

2018-11-11 Thread Brian Milby via use-livecode
I just tried one additional test. Search for "åå" within "aaååÅÅååaa". (On a Mac keyboard, the characters are made with A, Option-A, and Shift-Option-A.) The Offset UTF16 version does not return the correct result if case sensitive is false (returns the same value as if it were true: 3,7). Every

Re: How to find offsets in Unicode Text fast

2018-11-11 Thread Brian Milby via use-livecode
I just pushed an updated binary stack that adds check boxes for case sensitive and no overlaps. These settings are per card so separate tests can be performed each way. Of note, the search for "The" in John 1 is quite a bit faster if case sensitive is true. Also, if case sensitive is true, then

Re: How to find offsets in Unicode Text fast

2018-11-11 Thread Brian Milby via use-livecode
I just posted an updated stack with the UTF16 and UTF32 offset variants. I did change the search on the first card to “The” and the counts remained the same so case folding does work for ASCII values.  I would need some other test text to check other cases where Unicode case folding would be exp

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Geoff Canyon via use-livecode
One thing I don't get is how (not) caseSensitive gets handled? Once the text is all binary data, is the engine really still able to look at the binary values for "A" and "a" and treat them as the same? On Sat, Nov 10, 2018 at 8:54 PM Brian Milby via use-livecode < use-livecode@lists.runrev.com> wr

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Brian Milby via use-livecode
The correct formula for UTF16 should be: put tPos div 2 + 1,"" after tResult The correct formula for UTF32 should be: put tPos div 4 + 1,"" after tResult If you go to card #6 of my stack that is on GitHub, it has the first chapter of John that I copied from the internet. I added a single UTF(8?)

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Niggemann, Bernd via use-livecode
Hi Richmond Richmond via use-livecode Sat, 10 Nov 2018 11:42:50 -0800 >I don't know who told you

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Niggemann, Bernd via use-livecode
That is what I alluded to, UTF is a wild country and I don't know my ways, try - function allOffsets pDelim, pString, pCaseSensitive local tNewPos, tPos, tResult put textEncode(pDelim,"UTF32") into pDelim put textEncode(pString,"UTF32") into pString set th

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Geoff Canyon via use-livecode
Unfortunately, I just discovered that your solution doesn't produce correct results. If I get the offsets of "aa" in "↘𠜎aa↘𠜎a↘𠜎", My code (and Brian Milby's) will return: 7,8,9,10 Your code will return: 9,10,11,12 As I understand it, textEncode transforms unicode text int

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Niggemann, Bernd via use-livecode
I figured that the slowdown was due to UTF8, for each char it has to test if it is a compounded character. So I just tried with utf16 figuring, that now it just compares at the byte-level. As it turned out it was indeed faster. Now I don't understand unicode but as I understand for some langua

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Richmond via use-livecode
I don't know who told you that ð was an Icelandic d. The ð is called the "eth", and was used in Anglo-Saxon interchangeably with the thorn to represent the 2 sounds that are now represented in English by the digraph th. As such Icelandic has retained the eth sign. In Icelandic the /d/ sound

Re: How to find offsets in Unicode Text fast

2018-11-10 Thread Geoff Canyon via use-livecode
This is faster -- under some circumstances, much faster! Any idea why textEncoding suddenly fixes everything? On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode < use-livecode@lists.runrev.com> wrote: > This is a little late but there was a discussion about the slowness of > simple