On 2017-06-23 03:19, Richard Gaskin via use-livecode wrote:
Seems murky.  I'd much rather at least have something like a byteLen
function, which returns the number of bytes for a given string.  With
that I can maintain byte offsets into a file with good performance and
no ambiguity.

You do:

  the number of bytes in textEncode(tString, <encoding>)

The 'number of bytes in a string' makes no sense as there is no direct relationship between bytes and strings. I appreciate why this idea hangs around - it used to be true - char and byte where the same concept prior to 7.0 but that's only because the concept of 'char' was that of ISO8859-1/Latin-1 which can only represent the following written languages:

Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.

If you step outside of that 'area', then it wasn't very much help (see https://www.terena.org/activities/multiling/ml-docs/iso-8859.html - for the historical encodings covering different sets of written languages).

The question you have to ask is 'how many bytes are in a string after it has been encoded in <encoding>' - when a string is written to disk an encoding *has* to be chosen. Sometimes the encoding is ASCII, sometimes it is UTF-8, sometimes it is UTF-16, sometimes it is something more exotic.

For any file format, an encoding of text always has to be defined - so you always 'know' if you know the file format (although some, the encoding might be indicated by a byte prefixing the encoded string, or as a piece of information in the header of the encoded file - e.g. Byte Order Marks).

How do I find a substring in binary data in a what that will tell me
the number of bytes of the offset?

If you have loaded binary data, and want to find the offset of a sequence of bytes within it then use 'byteOffset'.

If your binary data is actually encoded text data, then you need to textEncode the 'needle' (the thing you are searching for) first, making sure you do so with the encoding which the encoded text data requires:

  - put the encoded/raw data you want to search into tHaystackData
put textEncode(tNeedleText, <encoding of target data>) into tNeedleData
  put byteOffset(tNeedleData, tHaystackData) into tOffset

However, it is important to note that this only allows an exact match - you can't do caseless searches like this (or searches where you want 'e-acute' to match both 'e-acute' and 'e,combining-acute').

In the case of wanting to do caseless searches, then you need to do something like this:

   put textDecode(tHaystackData, <encoding of data>) into tHaystackText
   put offset(tNeedleText, tHaystackText) into tNeedleOffset
put the number of bytes in textEncode(char 1 to tNeedleOffset of tHaystackText) into tNeedleByteOffset

i.e. The operation you are wanting to perform is 'offset of <needle> in <data> when using encoding <encoding>' which might make a useful engine addition - feel free to file an enhancement, although the above snippet should work in script with the operations we currently have. (Similar, your 'byteLen' function, is actually 'length of string in encoding <encoding>' - that also might be a useful engine addition, but can also be done in script now, as outlined above).

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to