On 2017-06-22 23:18, Richard Gaskin via use-livecode wrote:
With many chunk expressions, I would imagine it does. With line
chunks, for example, the engine needs to walk through the string,
comparing each character to CR, counting the found CRs as it goes.
Yes - essentially that is the case (although technically it looks for
LF, not CR as currently - for better or for worse - the engine assumes
line means LF as the separator, and normalizes line endings
appropriately on a per-platform basis when you 'import' things as text
into LiveCode).
In this case, though, I believe it doesn't need a loop per se, since
AFAIK character are fixed-size entities internally (Mark Waddingham,
is that true that UTF-16 gives us two-bytes per char across the
board?).
No this is not quite true - characters are not fixed sized entities from
the computer's point of view. In LiveCode 'character' means 'grapheme' -
which is roughly what human's consider to be characters in terms of
writing and editing.
Indeed, there are several concepts here:
1) character: a character is a sequence of Unicode codepoints
2) codepoint: a codepoint is the index into the Unicode code table
(which has space for 1 million or so definitions)
3) codeunit: a codeunit is an index into the Basic Multilingual Plane
(BMP) - the first 65536 Unicode codes. The BMP contains a block of codes
called 'surrogates' which aren't actually codes in themselves, but allow
two codeunits to be used to express a codepoint for any code defined
above 65536.
Some examples:
Character 'a':
This is (as you might expect) always a single codepoint, and, indeed,
always a single codeunit (in Unicode 'a' is encoded with the same code
as it is in ASCII).
Character 'a-acute':
This can be either represented as a single codepoint (and codeunit)
'a-acute' (the same code as a-acute has in the ISO-8859-1 encoding, a
strict superset of ASCII).
Or it can be represented as two codepoints 'a', 'combining-acute'. In
both cases, these codepoints are in the BMP, so each codepoint is
represented as a single codeunit.
Character 'smiling face with open mouth emoji':
This has code 0x1F603 - meaning it falls outside of the BMP (it is >
65535). It is a single codepoint, but requires two codeunits to encode.
Some comparisons:
ASCII, ISO8859-1, Latin-1 and MacRoman are all 'single-codepoint'
encodings - all characters which those encodings can express are encoded
as a single codepoint.
Unicode is a 'multi-code' encoding - characters may require any number
of codepoints to express. For example:
- In Indic languages (which have a somewhat different structure than
languages like English, French, German etc.), many codepoints are often
needed to represent what humans might consider a 'character'.
- You can stack any number of defined 'combining accents' onto a base
character. You can have a character such as
a-acute-underbar-ring-grave-cedilla-umlaut if you want.
- Emoji codepoints can be prefixed by 'variation selectors' which
allow customization of things like face color.
Basically, Unicode is a model for encoding writing systems with the aim
that (over time) it can be used to represent *any* writing system which
exists now or existed in the past. In order to do this in a tractable
way (i.e. a way which could be implemented maintainably on modern
systems) it uses an abstract model (sequences of codepoints which form
characters). Due to this it can sometimes seem a little 'odd' but then
it is trying to model things which were not designed to necessarily fit
into a computer's viewpoint of the world - writing systems have evolved
organically without thought on how a computer might need to process
them.
In terms of LiveCode, then you have access to 'character', 'codepoint'
and 'codeunit' chunks. In general:
- character access for general strings is never constant time, as
characters can require multiple codepoints.
- codepoint access for general strings is never constant time, as
codepoints can require two codeunits to encode.
- codeunits access for general strings is always constant time.
Internally, the engine will keep things which can be represented in the
platform's native encoding as native as much as possible (the native
encodings have the property that 1 character = 1 codepoint = 1
codeunit); otherwise it will (currently) store things internally as
sequences of codeunits in the UTF-16 encoding. (How this might be done
in future may well change in order to permit optimization, for example
pure Greek or Russian text currently has a penalty compared to English
text as it will always require UTF-16 internal encoding; however with
the advent of Emoji and other such things, pure English text itself is
becoming much less common).
Most of the time 'character' is the most appropriate thing to use for
reading strings, whilst codepoints can be used to build up strings of
characters.
The presence of 'codeunit' chunks is to allow optimization of critical
routines in script as you can be sure that getting 'codeunit X of
tString' is an array lookup (i.e. one step of computer processing, no
loop needed).
Warmest Regards,
Mark.
--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode