Re: Text encoding: summary of results and times.

Ben Rubinstein via use-livecode Wed, 08 Sep 2021 02:44:06 -0700


On 07/09/2021 17:22, Bob Sneidar via use-livecode wrote:

This makes sense to me (I think) because if I am not mistaken, UTF16 is 
Unicode, and UTF8 is simple ASCII. The slowdown from 6.7 to 7.0 was precicely 
the support for Unicode text. Someone will correct me if I am wrong about this. 
As a hobbyist, I try and stay away from localization issues. But I am 
interested in the idea that all text incoming should be text decoded and 
outgoing the inverse. (Did I get that right??)

Cue scenes of strong men reeling back in horror, ladies fainting, etc (Batemancartoons, for those of a British persuasion).

UTF16 is not Unicode, UTF8 is not simple ASCII, and I'm not even sure that theslowdown from 6.7 to 7.0 was precisely the support for Unicode text, thoughI'm not sure about that.

Unicode and ASCII are both conventions that assign character interpretationsto numbers. ASCII assigned approximately 94 character interpretations to thenumbers 32-126 (plus a few control interpretations to some other numbers).WindowsLatin1, MacRoman, ISO-8859-1 etc all did the same but to a wider rangeof numbers up to 255. Unicode does the same thing for a... much... largernumber of characters and glyphs, and hence using a... much... larger range ofnumbers.

Unicode specifies numbers, not bytes. UTF8 and UTF16 are two of several waysof representing Unicode strings in bytes. UTF8 is designed to do so in a waythat makes ASCII text compatible with UTF8, i.e. a file of ASCII text is avalid UTF8 file; the reverse is not necessarily true.

A long-running problem with Metacard, Revolution, LC up to v6 was beingsurprisingly platform-centric about character sets. To this day, textEncodeetc only support MacRoman on Mac, only support ISO-8859-1 on Linux, and so on;as if we never are on one platform, needing to deal with character streamsgenerated on another. See

https://quality.livecode.com/show_bug.cgi?id=12205
https://quality.livecode.com/show_bug.cgi?id=22391
https://quality.livecode.com/show_bug.cgi?id=21320

LC7 brought LiveCode into the later part of the 20th century by properlysupporting Unicode, and by breaking the assumed link between bytes andcharacters. However if I understand correctly, the internal format of stringsdoes not, or at least not necessarily, correspond to any externally definedstandard, but can take various forms in order to maximise efficiencies ofprocessing and storage.


Not sure if this helps, but it helped me to write it!

Ben

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Text encoding: summary of results and times.

Reply via email to