Re: First 1000 characters without loop?

Mark Waddingham via use-livecode Fri, 23 Jun 2017 01:19:13 -0700

On 2017-06-22 23:18, Richard Gaskin via use-livecode wrote:

With many chunk expressions, I would imagine it does.  With line
chunks, for example, the engine needs to walk through the string,
comparing each character to CR, counting the found CRs as it goes.

Yes - essentially that is the case (although technically it looks forLF, not CR as currently - for better or for worse - the engine assumesline means LF as the separator, and normalizes line endingsappropriately on a per-platform basis when you 'import' things as textinto LiveCode).

In this case, though, I believe it doesn't need a loop per se, since
AFAIK character are fixed-size entities internally (Mark Waddingham,
is that true that UTF-16 gives us two-bytes per char across the
board?).

No this is not quite true - characters are not fixed sized entities fromthe computer's point of view. In LiveCode 'character' means 'grapheme' -which is roughly what human's consider to be characters in terms ofwriting and editing.


Indeed, there are several concepts here:

  1) character: a character is a sequence of Unicode codepoints

2) codepoint: a codepoint is the index into the Unicode code table(which has space for 1 million or so definitions)

3) codeunit: a codeunit is an index into the Basic Multilingual Plane(BMP) - the first 65536 Unicode codes. The BMP contains a block of codescalled 'surrogates' which aren't actually codes in themselves, but allowtwo codeunits to be used to express a codepoint for any code definedabove 65536.


Some examples:

Character 'a':

This is (as you might expect) always a single codepoint, and, indeed,always a single codeunit (in Unicode 'a' is encoded with the same codeas it is in ASCII).


Character 'a-acute':

This can be either represented as a single codepoint (and codeunit)'a-acute' (the same code as a-acute has in the ISO-8859-1 encoding, astrict superset of ASCII).

Or it can be represented as two codepoints 'a', 'combining-acute'. Inboth cases, these codepoints are in the BMP, so each codepoint isrepresented as a single codeunit.


Character 'smiling face with open mouth emoji':

This has code 0x1F603 - meaning it falls outside of the BMP (it is >65535). It is a single codepoint, but requires two codeunits to encode.


Some comparisons:

ASCII, ISO8859-1, Latin-1 and MacRoman are all 'single-codepoint'encodings - all characters which those encodings can express are encodedas a single codepoint.

Unicode is a 'multi-code' encoding - characters may require any numberof codepoints to express. For example:

- In Indic languages (which have a somewhat different structure thanlanguages like English, French, German etc.), many codepoints are oftenneeded to represent what humans might consider a 'character'.

- You can stack any number of defined 'combining accents' onto a basecharacter. You can have a character such asa-acute-underbar-ring-grave-cedilla-umlaut if you want.

- Emoji codepoints can be prefixed by 'variation selectors' whichallow customization of things like face color.

Basically, Unicode is a model for encoding writing systems with the aimthat (over time) it can be used to represent *any* writing system whichexists now or existed in the past. In order to do this in a tractableway (i.e. a way which could be implemented maintainably on modernsystems) it uses an abstract model (sequences of codepoints which formcharacters). Due to this it can sometimes seem a little 'odd' but thenit is trying to model things which were not designed to necessarily fitinto a computer's viewpoint of the world - writing systems have evolvedorganically without thought on how a computer might need to processthem.

In terms of LiveCode, then you have access to 'character', 'codepoint'and 'codeunit' chunks. In general:

- character access for general strings is never constant time, ascharacters can require multiple codepoints.

- codepoint access for general strings is never constant time, ascodepoints can require two codeunits to encode.


   - codeunits access for general strings is always constant time.

Internally, the engine will keep things which can be represented in theplatform's native encoding as native as much as possible (the nativeencodings have the property that 1 character = 1 codepoint = 1codeunit); otherwise it will (currently) store things internally assequences of codeunits in the UTF-16 encoding. (How this might be donein future may well change in order to permit optimization, for examplepure Greek or Russian text currently has a penalty compared to Englishtext as it will always require UTF-16 internal encoding; however withthe advent of Emoji and other such things, pure English text itself isbecoming much less common).

Most of the time 'character' is the most appropriate thing to use forreading strings, whilst codepoints can be used to build up strings ofcharacters.

The presence of 'codeunit' chunks is to allow optimization of criticalroutines in script as you can be sure that getting 'codeunit X oftString' is an array lookup (i.e. one step of computer processing, noloop needed).


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

Reply via email to