Re: What is LC's internal text format?

Ben Rubinstein via use-livecode Tue, 13 Nov 2018 11:35:32 -0800

That's really helpful - and in parts eye-opening - thanks Mark.


I have a few follow-up questions.

Does textEncode _always_ return a binary string? Or, if invoked with "CP1252","ISO-8859-1", "MacRoman" or "Native", does it return a string?


> CodepointOffset has signature 'integer codepointOffset(string)', so when you
> pass a binary string (data) value to it, the data value gets converted to a
> string by interpreting it as a sequence of bytes in the native encoding.

OK - so one message I take are that in fact one should never invokecodepointOffset on a binary string. Should it actually throw an error in thiscase?

By the same token, probably one should only use 'byte', 'byteOffset','byteToNum' etc with binary strings - would it be better, to avoid confusion,if char, offset, charToNum should refuse to operate on a binary string?

e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.

The second message I take is that one needs to be very careful, if operatingon UTF8 or other binary strings, to avoid 'contaminating' them e.g. byconcatenating with a simple quoted string, as this may cause it to be silentlyconverted to a non-binary string. (I presume that 'put "simple string"after/before pBinaryString' will cause a conversion in the same way as "&"?What about 'put "!" into char x of pBinaryString?)

The engine can tell whether a string is 'native' or UTF16. When the engine isconverting a binary string to 'string', does it always interpret the source asthe native 8-bit encoding, or does it have some heuristic to decide whether itwould be more plausible to interpret the source as UTF16?


Thanks again for all the detail!

Ben

On 13/11/2018 13:31, Mark Waddingham via use-livecode wrote:

On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
I'm grateful for all the information, but _outraged_ that the thread
that I carefully created separate from the offset thread was so
quickly hijacked for the continuing (useful!) detailed discussion on
that topic.
The phrase 'attempting to herd cats' springs to mind ;)
From recent contributions on both threads I'm getting some more
insights, but I'd really like to understand clearly what's going on. I
do think that I should have asked this question more broadly: how does
the engine represent values internally?
The engine uses a number of distinct types 'behind the scenes'. The ones
pertinent to LCS (there are many many more which LCS never sees) are:

   - nothing: a type with a single value nothing/null)
   - boolean: a type with two values true/false
   - number: a type which can either store a 32-bit integer *or* a double
- string: a type which can either store a sequence of native (single byte)codes, or a sequence of unicode (two byte - UTF-16) codes - name: a type which stores a string, but uniques the string so thatcaseless and exact equality checking is constant time
   - data: a type which stores a sequence of bytes
- array: a type which stores (using a hashtable) a mapping from 'names' toany other storage value type
The LCS part of the engine then sits on top of these core types, providing
various conversions depending on context.

All LCS syntax is actually typed - meaning that when you pass a value to any
piece of LCS syntax, each argument is converted to the type required.
e.g. charToNativeNum() has signature 'integer charToNativeNum(string)' meaningthat it
expects a string as input and will return a number as output.
Some syntax is overloaded - meaning that it can act in slightly different (butalways consistent) ways depending on the type of the arguments.
e.g. & has signatures 'string &(string, string)' and 'data &(data, data)'.
In simple cases where there is no overload, type conversion occurs exactly asrequired:
e.g. In the case of charToNativeNum() - it has no overload, so always expectsa stringwhich means that the input argument will always undergo a 'convert to string'operation.
The convert to string operation operates as follows:

    - nothing -> ""
    - boolean -> "true" or "false"
    - number -> decimal representation of the number, using numberFormat
    - string -> stays the same
    - name -> uses the string the name contains
    - data -> converts to a string using the native encoding
- array -> converts to empty (a very old semantic which probably does moreharm than good!)
In cases where syntax is overloaded, type conversion generally happens insyntax-specific sequence in order to preserve consistency:
e.g. In the case of &, it can either take two data arguments, or two stringarguments. In this case,if both arguments are data, then the result will be data. Otherwise botharguments will be converted
to strings, and a string returned.
From Monte I get that the internal encoding for 'string' may be
MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
presumably with some attribute to tell the engine which one in each
case.
Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows it
is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an internal
flag in a string value which says whether its character sequence issingle-byte (native)
or double-byte (UTF_16).
So then my question is whether a 'binary string' is a pure blob, with
no clues as to interpretation; or whether in fact it does have some
attributes to suggest that it might be interpreted as UTF8, UTF132
etc?
Data (binary string) values are pure blobs - they are sequences of bytes - it 
has
no knowledge of where it came from. Indeed, that would generally be a bad ideaas youwouldn't get repeatable semantics (i.e. a value from one codepath which isdata, mighthave a different effect in context from one which is fetched from somewhereelse).
That being said, the engine does store some flags on values - but purely foroptimization.i.e. To save later work. For example, a string value can store its (double)numeric value init - which saves multiple 'convert to number' operations performed on the same(pointer wise) string (due to the copy-on-write nature of values, and the factthat all literals are unique names, pointer-wise equality of values occurs agreat deal).
If there are no such attributes, how does codepointOffset operate when
passed a binary string?
CodepointOffset is has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to astring
by interpreting it as a sequence of bytes in the native encoding.
If there are such attributes, how do they get set? Evidently if
textEncode is used, the engine knows that the resulting value is the
requested encoding. But what happens if the program reads a file as
'binary' - presumable the result is a binary string, how does the
engine treat it?
There are no attributes of that ilk. When you read a file as binary you getdata (binarystring) values - which means when you pass them to string takingfunctions/commands thatdata gets interpreted as a sequence of bytes in the native encoding. This iswhy you mustalways explicitly textEncode/textDecode data values when you know they are notrepresenting
native encoded text.
Is there any way at LiveCode script level to detect what a value is,
in the above terms?
Yes - the 'is strictly' operators:

   is strictly nothing
   is strictly a boolean
   is strictly an integer - a number which has internal rep 32-bit int
   is strictly a real - a number which has internal rep double
   is strictly a string
   is strictly a binary string
   is strictly an array
It should be noted that 'is strictly' reports only how that value is storedand not anything based on the value itself. This only really applies to 'aninteger' and 'a real' - you can store an integer in a double and all LCSarithmetic operators act on doubles.
e.g. (1+2) is strictly an integer -> false
      (1+2) is strictly a real -> true
In contrast, though, *some* syntax will return numbers which are storedinternally as integers:
e.g. nativeCharToNum("a") is strictly an integer -> true
I should point out that what 'is strictly' operators return for any givencontext is not stable in the sense that future engine versions might returndifferent things. e.g. We might optimize arithmetic in the future (if we canfigure out a way to do it without performance penalty!) so that things whichare definitely integers, are stored as integers (e.g. 1 + 2 in the above).
And one more question: if a string, or binary string, is saved in a
'binary' file, are the bytes stored on disk a faithful rendition of
the bytes that composed the value in memory, or an interpretation of
some kind?
What happens when you read or write data or string values to a file depends onhow you opened the file.
If you opened the file for binary (whether reading or writing), when you readyou will get data, when you write string values will be converted to data viathe native encoding (default rule).
If you opened the file for text, then the engine will try and determine (usinga BOM) the existing text encoding of the file. If it can't determine it (iffor example, you are opening a file for write which doesn't exist), it willassume it is encoded as native.
Otherwise the file will have an explicit encoding associated with it specifiedby you - reading from it will interpret the bytes in that explicit encoding;while writing to it will expect string values which will be encodedappropriately. In the latter case if you write data values, they will first beconverted to a string (assuming native encoding) and then written as stringsin the file's encoding (i.e. default type conversion applies).
Essentially you can view file's a typed-stream - if you opened for binaryread/write give/take data; if you opened for text then read/write give/takestrings and default type conversion rules apply.
Warmest Regards,

Mark.


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: What is LC's internal text format?

Reply via email to