On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode < use-livecode@lists.runrev.com> wrote:
> > I'm really confused that case-insensitive should work at all for UTF-16 or > UTF-32; at this point as far as I understand it, LC has no idea that how > to > correctly interpret the value of the variable as text. > > Mr Very Picky would also suggest that to be really correct, the code in > this > case should also check that the offset found was on a four-byte boundary > (tPos > mod 4 = 1); it's probably a purely theoretical consideration, but I think > that > the four-byte sequence (representing the character you're searching for) > could > in theory be incorrectly matched across two other characters. > I also thought of the four-byte boundary consideration. The code below is available at: https://github.com/gcanyon/alloffsets For example, previous UTF-32 versions will fail on characters like 𐀁, which converts to 00010001 and therefore finding 𐀁 in 𐀁𐀁𐀁 would return 1,1,2,2,3. I don't know how many other possible issues there are, but given the current UTF-32 character set there are a few, but likely not many. The failure searching for "Reykjavík" in "Reykjavík er höfuðborg" is weirder and worse, obviously. I was puzzled at first by the case-sensitive functionality in UTF-32-encoded strings, but I realized that standard case-insensitive searches are presumably just implemented as a set of exceptions at a low level. For example, the the engine isn't looking at "a" and "A" and saying, "those are the same." Instead, it's looking at raw ACII and mapping 97 to 65 if case-insensitive is requested. The same must be true of UTF-32: the engine isn't looking at "Ѡ" and "ѡ", it's mapping 00000460 to 00000461. I agree that it seems a little odd that LC knows the string of binary data is "text", but maybe there's some trick to that? Anyway, here's the boundary-respecting, but still-flawed version of UTF-32-based allOffsets, with a documented bad example in a comment: function allOffsetsUTF32 pFind,pString,pCaseSensitive,pNoOverlaps -- returns a comma-delimited list of the offsets of pFind in pString -- note, this seems to fail on some searches, for example: -- searching for "Reykjavík" in "Reykjavík er höfuðborg" -- It is uncertain why. -- See thread here: http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html local tNewPos, tPos, tResult, tSkip put textEncode(pFind,"UTF-32") into pFind put textEncode(pString,"UTF-32") into pString if pNoOverlaps then put length(pFind) - 1 into tSkip set the caseSensitive to pCaseSensitive is true put 0 into tPos repeat forever put offset(pFind, pString, tPos) into tNewPos if tNewPos = 0 then exit repeat add tNewPos to tPos if tPos mod 4 = 1 then put (tPos div 4 + 1),"" after tResult if pNoOverlaps then add tSkip to tPos end repeat if tResult is empty then return 0 else return char 1 to -2 of tResult end allOffsetsUTF32 _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode