On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
use-livecode@lists.runrev.com> wrote:
Unless I'm misunderstanding, this hasn't been my observation. Using offset on a string that has been textEncodet()ed to UTF-32 returns values that are 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find five instances of that character in the UTF-32 encoding because of improper
boundaries. To see this, run this code:

on mouseUp
   put textencode("𐀁","UTF-32") into X
   put textencode("𐀁𐀁𐀁","UTF-32") into Y
   put offset(X,Y,1)
end mouseUp

That will return 2, meaning that it found the encoding for X starting at character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"

The textEncode function generates binary data which is composed of bytes. When you use binary data in a text function (which offset is), the engine uses a compatability conversion which treats the sequence of bytes as a sequence of native characters (this preserves what happened pre-7.0 when strings were only ever native, and as such binary and string were essentially the same thing).

So if you textEncode a 1 (native) character string as UTF-32, you will get a four byte string, which will then turn back into a 4 (native) character string when passed to offset.

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to