I thought of a quick way to do a first pass and it can almost fit in the margin.

> On Sep 19, 2019, at 10:25 AM, Dar Scott Consulting via use-livecode 
> <use-livecode@lists.runrev.com> wrote:
> 
> UTF-16 and UTF-32 are not needed in your list. Those are BE unless indicated 
> otherwise by a leading BOM. That is, the BE and LE versions are sufficient. 
> 
> ASCII encoding is a subset of CP1252, MacRoman and UTF-8, so that can be 
> classified as UTF-8 if there is no advantage to knowing that it is ASCII. 
> (Printable ASCII is a subset of ISO-8859-1). 
> 
> A couple thoughts in creating a custom function. Your special codes in ASCII 
> files of 1, 2, 3 and 4 can be considered in a custom function. You might have 
> a good idea in just 128 bytes or maybe a few iterations of 32 bytes. You can 
> consider an a priori ordering of likelihood, related to the question of which 
> tests provide the most information in the least time. And if you can't tell 
> the difference, then maybe it doesn't matter. 
> 
> I considered some methods of adjusting probabilities but the overhead means 
> the test chunks should not be trivial. Also, the probability might be 
> simplified to "maybe" and "nope". (However, if there might be errors in the 
> text or discernment needs to rely on text probabilities, the numbers might be 
> best.)  Tests move probabilities from maybe to nope.
> 
> One method might do a batch of unsigned 32-bit int decodes and do logic 
> operations on each of those. That can only do partial elimination tests on 
> UTF-8, but detailed tests can be done afterward. I am not sure about 
> performance, it might be that byteToNum() would be much faster.
> 
> I'm guessing that one can get some good probabilities from the first four 
> bytes.
> 
> So, I agree with Curry. He might not use anything I mentioned, but he can 
> optimize your code for longer files, if you need full checking.
> 
>> On Sep 17, 2019, at 2:05 PM, Paul Dupuis via use-livecode 
>> <use-livecode@lists.runrev.com> wrote:
>> 
>> I started this post of the DEV-LIST. Mark Waddingham kindly responded and 
>> smartly suggested I should move it to the USE-LIST, so that is what I am 
>> doing. I have also pasted Lark's reply below my original post.
>> 
>> ---------------------- ORIGINAL POST ----------------------------------------
>> 
>> I have a LiveCode Script (LCS) routine that attempts to follow industry 
>> common algorithms for guessing the encoding of a text file.
>> 
>> It's performance can be slower than I would like.
>> 
>> This has led me to wonder in a LiveCode Builder (LCB) library may be the 
>> route to go. Does anyone know the OSX and/or Windows APIs for guessing a 
>> text file's encoding?
>> 
>> I have done a number of google searches, but I am not a C programmer (not in 
>> many decades) and wading through the huge doc sets at MSDN or Apple is 
>> daunting.
>> 
>> I found reference to a windows API:
>> 
>> BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );
>> 
>> Which suggests to me that such APIs may exists. Does anyone who is better at 
>> finding OS APIs know where to find such APIs? Can you point me to the right 
>> online documentation?
>> 
>> I also found this: 
>> https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
>> 
>> Of course, it would be wonderful if the mothership delivered this. At one 
>> point Frasier said he would back around LC7 something.
>> 
>> https://quality.livecode.com/show_bug.cgi?id=14474
>> 
>> It seems an LCB library that uses OS APIs to return best guess for file 
>> encoding that match up with the textEncode/Decode functions would be a great 
>> addition to LC
>> 
>> * "ASCII"
>> * "UTF-16"
>> * "UTF-16BE"
>> * "UTF-16LE"
>> * "UTF-32"
>> * "UTF-32BE"
>> * "UTF-32LE"
>> * "UTF-8"
>> * "CP1252"
>> * "ISO-8859-1"
>> * "MacRoman"
>> 
>> and I suppose "Binary" as the default if none of the above can be detected
>> 
>> ----------------- MARK'S REPLY ----------------------------------------
>> On 2019-09-13 16:44, Paul Dupuis wrote:
>>> I have a LiveCode Script (LCS) routine that attempts to follow
>>> industry common algorithms for guessing the encoding of a text file.
>>> 
>>> It's performance can be slower than I would like.
>> 
>> If you share your code perhaps we can help speed it up...
>> 
>>> This has led me to wonder in a LiveCode Builder (LCB) library may be
>>> the route to go. Does anyone know the OSX and/or Windows APIs for
>>> guessing a text file's encoding?
>>> 
>>> I have done a number of google searches, but I am not a C programmer
>>> (not in many decades) and wading through the huge doc sets at MSDN or
>>> Apple is daunting.
>>> 
>>> I found reference to a windows API:
>>> 
>>> BOOL IsTextUnicode(
>>>  const VOID *lpv,
>>>  int        iSize,
>>>  LPINT      lpiResult
>>> );
>>> 
>>> Which suggests to me that such APIs may exists. Does anyone who is
>>> better at finding OS APIs know where to find such APIs? Can you point
>>> me to the right online documentation?
>> 
>> Libraries certainly exist: Mozilla has a 'universal charset detector 
>> library' for example, which appears to use various statistical heuristics to 
>> tell between all kinds of encodings.
>> 
>> The 'IsTextUnicode' API seems to just tell you whether a sequence of bytes 
>> is likely to be UTF-16 or not UTF-16; so probably won't be all that helpful 
>> if that isn't all you are wanting to distinguish between.
>> 
>> Do you have a list of encodings you are needing to guess between? That will 
>> generally influence how fast (and accurate) you can make such a function 
>> (its almost trivial to detect UTF-8 with a high degree of confidence, UTF-32 
>> I think as well, UTF-16 is somewhat harder, and distinguishing between 
>> single-byte and legacy multi-byte charsets is, relatively speaking, very 
>> hard).
>> 
>> Warmest Regards,
>> 
>> Mark.
>> 
>> P.S. This might be a better discussion to have on the use-list unless there 
>> is a reason not to, it might be of interest to others in that wider group.
>> 
>> -- 
>> Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
>> LiveCode: Everyone can create apps
>> 
>> _______________________________________________
>> livecode-dev mailing list
>> livecode-...@lists.runrev.com
>> http://lists.runrev.com/mailman/listinfo/livecode-dev
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to