Re: What is LC's internal text format?

Ben Rubinstein via use-livecode Tue, 20 Nov 2018 08:34:56 -0800

Hi Monte,

Thanks for this, sorry for delayed reply - I've been away.

>> Does textEncode _always_ return a binary string? Or, if invoked with"CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?

> Internally we have different types of values. So we have MCStringRef whichis the thing which either contains a buffer of native chars or a buffer ofUTF-16 chars. There are others.

...

> The return type of textEncode is an MCDataRef. This is a byte buffer,buffer size & byte count.

>
> So:
> put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
>
> Then if we do something like:
> set the text of field “foo” to tFoo
>

> tFoo is first converted to MCStringRef. As it’s an MCDataRef we just movethe buffer over and say it’s a native encoded string. There’s no checking tosee if it’s a UTF-8 string and decoding with that etc.

So my question would be, is this helpful? If, given any MCDataRef (i.e.'binary string') LC makes the assumption - when it needs an MCStringRef - thatthe binary string is 'native' - then I would think it will be wrong more oftenthat is correct!

IIUC, the chief ways to obtain an MCDataRef are by reading a file in binarymode, or by calling textEncode (or loading a non-file URL???). Insofar as onecould make an assumption at all, my guess is that in the first case the datais more likely to be UTF8; and whatever is most likely in the second case,'native' is about the least likely. (If the assumption was UTF16 it would atleast make more sense.)

Would it not be better to refuse to make an assumption, i.e. require anexplicit conversion? If you want to proceed on the assumption that a file is'native' text, read it as text; if you know what it is, read it as binary anduse textEncode. If you used textEncode anyway (or numToByte) then obviouslyyou know what it is, and when you want to make a string out of it you can tellLC how to interpret it. Wouldn't it be better to throw an error if passing anMCDataRef where an MCStringRef is required, than introduce subtle errors byjust making (in my opinion implausible) assumptions?

And now that the thought has occurred to me - when a URL with a non-fileprotocol is used a source of value, what is the type of the value -MCStringRef or MCDataRef?


thanks for the continuing education!

Ben

On 13/11/2018 23:44, Monte Goulding via use-livecode wrote:

On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode 
<use-livecode@lists.runrev.com> wrote:

That's really helpful - and in parts eye-opening - thanks Mark.

I have a few follow-up questions.

Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", 
"MacRoman" or "Native", does it return a string?


Internally we have different types of values. So we have MCStringRef which is 
the thing which either contains a buffer of native chars or a buffer of UTF-16 
chars. There are others. For example, MCNumberRef will either hold a 32 bit 
signed int or a double. These are returned by numeric operations where there’s 
no string representation of a number. So:

put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef

The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size 
& byte count.

So:
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef

Then if we do something like:
set the text of field “foo” to tFoo

tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the 
buffer over and say it’s a native encoded string. There’s no checking to see if 
it’s a UTF-8 string and decoding with that etc.

Then the string is put into the field.

If you remember that mergJSON issue you reported where mergJSON returns UTF-8 
data and you were putting it into a field and it looked funny this is why.

CodepointOffset has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string by interpreting it as a sequence of bytes in the native encoding.


OK - so one message I take are that in fact one should never invoke 
codepointOffset on a binary string. Should it actually throw an error in this 
case?


No, as mentioned above values can move to and from different types according to 
the operations performed on them and this is largely opaque to the scripter. If 
you do a text operation on a binary string then there’s an implicit conversion 
to a native encoded string. You generally want to use codepoint in 7+ generally 
where previously you used char unless you know you are dealing with a binary 
string and then you use byte.


By the same token, probably one should only use 'byte', 'byteOffset', 
'byteToNum' etc with binary strings - would it be better, to avoid confusion, 
if char, offset, charToNum should refuse to operate on a binary string?


That would not be backwards compatible.

e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.

The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to 
avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently 
converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will 
cause a conversion in the same way as "&"? What about 'put "!" into char x of 
pBinaryString?)


When concatenating if both left and right are binary strings (MCDataRef) then 
there’s no conversion of either to string however we do not currently have a 
way to declare a literal as a binary string (might be nice if we did!) so you 
would need to:

put textEncode("simple string”, “UTF-8”) after pBinaryString


The engine can tell whether a string is 'native' or UTF16. When the engine is 
converting a binary string to 'string', does it always interpret the source as 
the native 8-bit encoding, or does it have some heuristic to decide whether it 
would be more plausible to interpret the source as UTF16?


No it does not try to interpret. ICU has a charset detector that will give you 
a list of possible charsets along with a confidence. It could be implemented as 
a separate api:

get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array 
of charset/confidence pairs

get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> 
charset

Feel free to feature request that!

Cheers

Monte


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: What is LC's internal text format?

Reply via email to