Hi Monte,
Thanks for this, sorry for delayed reply - I've been away.
>> Does textEncode _always_ return a binary string? Or, if invoked with
"CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
>
> Internally we have different types of values. So we have MCStringRef which
is the thing which either contains a buffer of native chars or a buffer of
UTF-16 chars. There are others.
...
> The return type of textEncode is an MCDataRef. This is a byte buffer,
buffer size & byte count.
>
> So:
> put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
>
> Then if we do something like:
> set the text of field “foo” to tFoo
>
> tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move
the buffer over and say it’s a native encoded string. There’s no checking to
see if it’s a UTF-8 string and decoding with that etc.
So my question would be, is this helpful? If, given any MCDataRef (i.e.
'binary string') LC makes the assumption - when it needs an MCStringRef - that
the binary string is 'native' - then I would think it will be wrong more often
that is correct!
IIUC, the chief ways to obtain an MCDataRef are by reading a file in binary
mode, or by calling textEncode (or loading a non-file URL???). Insofar as one
could make an assumption at all, my guess is that in the first case the data
is more likely to be UTF8; and whatever is most likely in the second case,
'native' is about the least likely. (If the assumption was UTF16 it would at
least make more sense.)
Would it not be better to refuse to make an assumption, i.e. require an
explicit conversion? If you want to proceed on the assumption that a file is
'native' text, read it as text; if you know what it is, read it as binary and
use textEncode. If you used textEncode anyway (or numToByte) then obviously
you know what it is, and when you want to make a string out of it you can tell
LC how to interpret it. Wouldn't it be better to throw an error if passing an
MCDataRef where an MCStringRef is required, than introduce subtle errors by
just making (in my opinion implausible) assumptions?
And now that the thought has occurred to me - when a URL with a non-file
protocol is used a source of value, what is the type of the value -
MCStringRef or MCDataRef?
thanks for the continuing education!
Ben
On 13/11/2018 23:44, Monte Goulding via use-livecode wrote:
On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode
<use-livecode@lists.runrev.com> wrote:
That's really helpful - and in parts eye-opening - thanks Mark.
I have a few follow-up questions.
Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1",
"MacRoman" or "Native", does it return a string?
Internally we have different types of values. So we have MCStringRef which is
the thing which either contains a buffer of native chars or a buffer of UTF-16
chars. There are others. For example, MCNumberRef will either hold a 32 bit
signed int or a double. These are returned by numeric operations where there’s
no string representation of a number. So:
put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef
The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size
& byte count.
So:
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
Then if we do something like:
set the text of field “foo” to tFoo
tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the
buffer over and say it’s a native encoded string. There’s no checking to see if
it’s a UTF-8 string and decoding with that etc.
Then the string is put into the field.
If you remember that mergJSON issue you reported where mergJSON returns UTF-8
data and you were putting it into a field and it looked funny this is why.
CodepointOffset has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string by interpreting it as a sequence of bytes in the native encoding.
OK - so one message I take are that in fact one should never invoke
codepointOffset on a binary string. Should it actually throw an error in this
case?
No, as mentioned above values can move to and from different types according to
the operations performed on them and this is largely opaque to the scripter. If
you do a text operation on a binary string then there’s an implicit conversion
to a native encoded string. You generally want to use codepoint in 7+ generally
where previously you used char unless you know you are dealing with a binary
string and then you use byte.
By the same token, probably one should only use 'byte', 'byteOffset',
'byteToNum' etc with binary strings - would it be better, to avoid confusion,
if char, offset, charToNum should refuse to operate on a binary string?
That would not be backwards compatible.
e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.
The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to
avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently
converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will
cause a conversion in the same way as "&"? What about 'put "!" into char x of
pBinaryString?)
When concatenating if both left and right are binary strings (MCDataRef) then
there’s no conversion of either to string however we do not currently have a
way to declare a literal as a binary string (might be nice if we did!) so you
would need to:
put textEncode("simple string”, “UTF-8”) after pBinaryString
The engine can tell whether a string is 'native' or UTF16. When the engine is
converting a binary string to 'string', does it always interpret the source as
the native 8-bit encoding, or does it have some heuristic to decide whether it
would be more plausible to interpret the source as UTF16?
No it does not try to interpret. ICU has a charset detector that will give you
a list of possible charsets along with a confidence. It could be implemented as
a separate api:
get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array
of charset/confidence pairs
get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) ->
charset
Feel free to feature request that!
Cheers
Monte
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode