On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
use-livecode@lists.runrev.com> wrote:
Unless I'm misunderstanding, this hasn't been my observation. Using
offset
on a string that has been textEncodet()ed to UTF-32 returns values that
are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will
find
five instances of that character in the UTF-32 encoding because of
improper
boundaries. To see this, run this code:
on mouseUp
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
put offset(X,Y,1)
end mouseUp
That will return 2, meaning that it found the encoding for X starting
at
character 2 + 1 = 3 of Y. In other words, it found X using the last
half of
the first "𐀁" and the first half of the second "𐀁"
The textEncode function generates binary data which is composed of
bytes. When you use binary data in a text function (which offset is),
the engine uses a compatability conversion which treats the sequence of
bytes as a sequence of native characters (this preserves what happened
pre-7.0 when strings were only ever native, and as such binary and
string were essentially the same thing).
So if you textEncode a 1 (native) character string as UTF-32, you will
get a four byte string, which will then turn back into a 4 (native)
character string when passed to offset.
Warmest Regards,
Mark.
--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode