Re: First 1000 characters without loop?

Mark Waddingham via use-livecode Fri, 23 Jun 2017 01:37:36 -0700

On 2017-06-23 03:19, Richard Gaskin via use-livecode wrote:

Seems murky.  I'd much rather at least have something like a byteLen
function, which returns the number of bytes for a given string.  With
that I can maintain byte offsets into a file with good performance and
no ambiguity.


You do:

  the number of bytes in textEncode(tString, <encoding>)

The 'number of bytes in a string' makes no sense as there is no directrelationship between bytes and strings. I appreciate why this idea hangsaround - it used to be true - char and byte where the same concept priorto 7.0 but that's only because the concept of 'char' was that ofISO8859-1/Latin-1 which can only represent the following writtenlanguages:

Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch,English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish,Italian, Norwegian, Portuguese, Spanish and Swedish.

If you step outside of that 'area', then it wasn't very much help (seehttps://www.terena.org/activities/multiling/ml-docs/iso-8859.html - forthe historical encodings covering different sets of written languages).

The question you have to ask is 'how many bytes are in a string after ithas been encoded in <encoding>' - when a string is written to disk anencoding *has* to be chosen. Sometimes the encoding is ASCII, sometimesit is UTF-8, sometimes it is UTF-16, sometimes it is something moreexotic.

For any file format, an encoding of text always has to be defined - soyou always 'know' if you know the file format (although some, theencoding might be indicated by a byte prefixing the encoded string, oras a piece of information in the header of the encoded file - e.g. ByteOrder Marks).

How do I find a substring in binary data in a what that will tell me
the number of bytes of the offset?

If you have loaded binary data, and want to find the offset of asequence of bytes within it then use 'byteOffset'.

If your binary data is actually encoded text data, then you need totextEncode the 'needle' (the thing you are searching for) first, makingsure you do so with the encoding which the encoded text data requires:


  - put the encoded/raw data you want to search into tHaystackData

put textEncode(tNeedleText, <encoding of target data>) intotNeedleData

  put byteOffset(tNeedleData, tHaystackData) into tOffset

However, it is important to note that this only allows an exact match -you can't do caseless searches like this (or searches where you want'e-acute' to match both 'e-acute' and 'e,combining-acute').

In the case of wanting to do caseless searches, then you need to dosomething like this:


   put textDecode(tHaystackData, <encoding of data>) into tHaystackText
   put offset(tNeedleText, tHaystackText) into tNeedleOffset

put the number of bytes in textEncode(char 1 to tNeedleOffset oftHaystackText) into tNeedleByteOffset

i.e. The operation you are wanting to perform is 'offset of <needle> in<data> when using encoding <encoding>' which might make a useful engineaddition - feel free to file an enhancement, although the above snippetshould work in script with the operations we currently have. (Similar,your 'byteLen' function, is actually 'length of string in encoding<encoding>' - that also might be a useful engine addition, but can alsobe done in script now, as outlined above).


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

Reply via email to