On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:
Some Unicode characters, such as emojis, have to be represented by two
codepoints in UTF-16 (known as surrogates) so they take four bytes not
two. Additionally, the number of bytes for characters with accents
will take either one codepoint or two depending on whether they have
been coded in pre-composed or decomposed form. (e.g. ç can be either
U+0063 U+0327 (decomposed) or U+00E7 (precomposed).

So it is isn’t easy to estimate the number of bytes in a UTF-16 string.

The number of bytes used by a string when encoded as UTF-16 is '2 * the number of codeunits in tString'.

The number of codeunits in a string in LiveCode is a stored property of the string, so doesn't require any computation. (We took the decision that regardless of how a string is stored internally, it should always be possible to ask for the number of codeunits in constant time, and to be able to look up a codeunit in constant time).

Note: codeunit is not the same as codepoint and codepoint is not the same as character. Both codepoint and character require scanning the string (in the general case) to both compute the i'th one, and to compute the length.

In contrast (to UTF-16), if you want the number of bytes a string takes up in UTF-8 encoding then you also have to scan the string as a codepoint in UTF-8 can be 1-4 bytes in length.

I would guess that LiveCode will store the characters of a string in
single bytes if all the letters of the string conform to ISO-8859-1.
So if you can be certain that your text is all ISO-8859-1 encoded, you
can estimate at 1 byte per character. (The guess is base on the fact
that the first 256 Unicode code points replicate ISO-8859-1).

Almost true - the engine stores strings which can be fit into the running platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1, Latin-1, MacRoman) in that encoding in memory. This means that stacks written pre-unicode will use the same amount of memory, same amount of processing time as they did before.

The reason this works is because all three of those encodings have the property that when they are converted to Unicode, the number of codeunits in the Unicode version is the same as the number of codes (indeed, bytes in this case) in the original string.

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to