On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:
Some Unicode characters, such as emojis, have to be represented by two
codepoints in UTF-16 (known as surrogates) so they take four bytes not
two. Additionally, the number of bytes for characters with accents
will take either one codepoint or two depending on whether they have
been coded in pre-composed or decomposed form. (e.g. ç can be either
U+0063 U+0327 (decomposed) or U+00E7 (precomposed).
So it is isn’t easy to estimate the number of bytes in a UTF-16 string.
The number of bytes used by a string when encoded as UTF-16 is '2 * the
number of codeunits in tString'.
The number of codeunits in a string in LiveCode is a stored property of
the string, so doesn't require any computation. (We took the decision
that regardless of how a string is stored internally, it should always
be possible to ask for the number of codeunits in constant time, and to
be able to look up a codeunit in constant time).
Note: codeunit is not the same as codepoint and codepoint is not the
same as character. Both codepoint and character require scanning the
string (in the general case) to both compute the i'th one, and to
compute the length.
In contrast (to UTF-16), if you want the number of bytes a string takes
up in UTF-8 encoding then you also have to scan the string as a
codepoint in UTF-8 can be 1-4 bytes in length.
I would guess that LiveCode will store the characters of a string in
single bytes if all the letters of the string conform to ISO-8859-1.
So if you can be certain that your text is all ISO-8859-1 encoded, you
can estimate at 1 byte per character. (The guess is base on the fact
that the first 256 Unicode code points replicate ISO-8859-1).
Almost true - the engine stores strings which can be fit into the
running platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1,
Latin-1, MacRoman) in that encoding in memory. This means that stacks
written pre-unicode will use the same amount of memory, same amount of
processing time as they did before.
The reason this works is because all three of those encodings have the
property that when they are converted to Unicode, the number of
codeunits in the Unicode version is the same as the number of codes
(indeed, bytes in this case) in the original string.
Warmest Regards,
Mark.
--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode