I didn't realize this conversation was just between Bernd and me, so here it is for the list. Bernd found a solution for the Reykjavík issue (seemingly -- it works, but it's weird) and based on a conversation in another thread I have a solution for non-case-sensitive matching. So the UTF-32 version has been updated to account for those issues. It's available here: https://github.com/gcanyon/alloffsets
On Tue, Nov 13, 2018 at 12:23 PM Geoff Canyon <gcan...@gmail.com> wrote: > It's amazing to me that appending the character "せ" for the textEncoding > fixes the issues with the other characters. I have no idea why that would > affect anything else at all. Maybe the engine crew can weigh in. > > In any case, you seem to have hit on the right bizarre solution, so I > added that in. I also added a modification to correctly handle case by > using toUpper instead of (wrongly) depending on caseSensitive, and changed > from offset to byteOffset, which might speed things up a little. The UTF32 > version is about 3x faster than the item-based solution, but both scale > well, so I added comments leaving it up to the developer which to use. > > The updated version is here: https://github.com/gcanyon/alloffsets > > On Tue, Nov 13, 2018 at 9:34 AM Niggemann, Bernd < > bernd.niggem...@uni-wh.de> wrote: > >> Geoff, >> >> The thread is very instructive but also a bit disillusioning as far as >> speed goes. I tried a couple of things Mark Waddingham recommended and they >> kind of work (I don't know if I did it correctly) but are still slow. Not >> as slow as simple offset for complex texts but still. >> >> Here I pick up on your latest attempt to use UTF-32 which fails on >> Icelandic Reykjavík (the í is the culprit). There are more Icelandic >> characters that fail UTF32. >> >> On the other hand UTF-32 works surprisingly fast and in many cases >> accurately. >> >> Now I figured that forcing the text to be UTF-32 compliant I would cheat >> in appending a Japanese character to pFind and pSearch before converting to >> UTF-32 and removing those afterwards. >> >> It turns out that it cures the Icelandic disease... >> It should also cure similar cases in similar languages. >> It turns out to be accurate in many things I tested in Brian's test stack. >> >> I would love to know the limits of this approach >> >> Kind regards >> >> Bernd >> >> here is the code, additions marked as "new" >> >> >> ------------------------------------------------- >> *function* allOffsets pFind,pString,pCaseSensitive,pNoOverlaps >> *-- returns a comma-delimited list of the offsets of pFind in pString* >> *-- note, this seems to fail on some searches, for example:* >> *-- searching for "Reykjavík" in "Reykjavík er höfuðborg"* >> *-- It is uncertain why.* >> *-- See thread here: >> http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html >> <http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html>* >> *local* tNewPos, tPos, tResult, tSkip >> >> >> *put* "せ" after pFind *#<- new force UTF-32* >> *put* "せ" after pString *#<- new force UTF-32* >> >> >> >> >> *put* textEncode(pFind,"UTF-32") into pFind >> *put* textEncode(pString,"UTF-32") into pString >> >> >> *delete* byte -4 to -1 of pFind *#<- new force UTF-32* >> *delete* byte -4 to -1 of pString *#<- new force UTF-32* >> >> >> *if* pNoOverlaps *then* *put* length(pFind) - 1 into tSkip >> >> >> *set* the caseSensitive to pCaseSensitive is true >> *put* 0 into tPos >> *repeat* forever >> *put* offset(pFind, pString, tPos) into tNewPos >> *if* tNewPos = 0 *then* *exit* *repeat* >> *add* tNewPos to tPos >> *if* tPos mod 4 = 1 *then* *put* (tPos div 4 + 1),"" after tResult >> *if* pNoOverlaps *then* *add* tSkip to tPos >> *end* * repeat* >> *if* tResult is empty *then* *return* 0 >> *else* *return* char 1 to -2 of tResult >> *end* allOffsets >> ------------------------------------------------- >> > _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode