Re: First 1000 characters without loop?

Mark Waddingham via use-livecode Fri, 23 Jun 2017 02:12:07 -0700

On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:

Some Unicode characters, such as emojis, have to be represented by two
codepoints in UTF-16 (known as surrogates) so they take four bytes not
two. Additionally, the number of bytes for characters with accents
will take either one codepoint or two depending on whether they have
been coded in pre-composed or decomposed form. (e.g. ç can be either
U+0063 U+0327 (decomposed) or U+00E7 (precomposed).


So it is isn’t easy to estimate the number of bytes in a UTF-16 string.

The number of bytes used by a string when encoded as UTF-16 is '2 * thenumber of codeunits in tString'.

The number of codeunits in a string in LiveCode is a stored property ofthe string, so doesn't require any computation. (We took the decisionthat regardless of how a string is stored internally, it should alwaysbe possible to ask for the number of codeunits in constant time, and tobe able to look up a codeunit in constant time).

Note: codeunit is not the same as codepoint and codepoint is not thesame as character. Both codepoint and character require scanning thestring (in the general case) to both compute the i'th one, and tocompute the length.

In contrast (to UTF-16), if you want the number of bytes a string takesup in UTF-8 encoding then you also have to scan the string as acodepoint in UTF-8 can be 1-4 bytes in length.

I would guess that LiveCode will store the characters of a string in
single bytes if all the letters of the string conform to ISO-8859-1.
So if you can be certain that your text is all ISO-8859-1 encoded, you
can estimate at 1 byte per character. (The guess is base on the fact
that the first 256 Unicode code points replicate ISO-8859-1).

Almost true - the engine stores strings which can be fit into therunning platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1,Latin-1, MacRoman) in that encoding in memory. This means that stackswritten pre-unicode will use the same amount of memory, same amount ofprocessing time as they did before.

The reason this works is because all three of those encodings have theproperty that when they are converted to Unicode, the number ofcodeunits in the Unicode version is the same as the number of codes(indeed, bytes in this case) in the original string.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

Reply via email to