I thought of a quick way to do a first pass and it can almost fit in the margin.
> On Sep 19, 2019, at 10:25 AM, Dar Scott Consulting via use-livecode > <use-livecode@lists.runrev.com> wrote: > > UTF-16 and UTF-32 are not needed in your list. Those are BE unless indicated > otherwise by a leading BOM. That is, the BE and LE versions are sufficient. > > ASCII encoding is a subset of CP1252, MacRoman and UTF-8, so that can be > classified as UTF-8 if there is no advantage to knowing that it is ASCII. > (Printable ASCII is a subset of ISO-8859-1). > > A couple thoughts in creating a custom function. Your special codes in ASCII > files of 1, 2, 3 and 4 can be considered in a custom function. You might have > a good idea in just 128 bytes or maybe a few iterations of 32 bytes. You can > consider an a priori ordering of likelihood, related to the question of which > tests provide the most information in the least time. And if you can't tell > the difference, then maybe it doesn't matter. > > I considered some methods of adjusting probabilities but the overhead means > the test chunks should not be trivial. Also, the probability might be > simplified to "maybe" and "nope". (However, if there might be errors in the > text or discernment needs to rely on text probabilities, the numbers might be > best.) Tests move probabilities from maybe to nope. > > One method might do a batch of unsigned 32-bit int decodes and do logic > operations on each of those. That can only do partial elimination tests on > UTF-8, but detailed tests can be done afterward. I am not sure about > performance, it might be that byteToNum() would be much faster. > > I'm guessing that one can get some good probabilities from the first four > bytes. > > So, I agree with Curry. He might not use anything I mentioned, but he can > optimize your code for longer files, if you need full checking. > >> On Sep 17, 2019, at 2:05 PM, Paul Dupuis via use-livecode >> <use-livecode@lists.runrev.com> wrote: >> >> I started this post of the DEV-LIST. Mark Waddingham kindly responded and >> smartly suggested I should move it to the USE-LIST, so that is what I am >> doing. I have also pasted Lark's reply below my original post. >> >> ---------------------- ORIGINAL POST ---------------------------------------- >> >> I have a LiveCode Script (LCS) routine that attempts to follow industry >> common algorithms for guessing the encoding of a text file. >> >> It's performance can be slower than I would like. >> >> This has led me to wonder in a LiveCode Builder (LCB) library may be the >> route to go. Does anyone know the OSX and/or Windows APIs for guessing a >> text file's encoding? >> >> I have done a number of google searches, but I am not a C programmer (not in >> many decades) and wading through the huge doc sets at MSDN or Apple is >> daunting. >> >> I found reference to a windows API: >> >> BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult ); >> >> Which suggests to me that such APIs may exists. Does anyone who is better at >> finding OS APIs know where to find such APIs? Can you point me to the right >> online documentation? >> >> I also found this: >> https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding >> >> Of course, it would be wonderful if the mothership delivered this. At one >> point Frasier said he would back around LC7 something. >> >> https://quality.livecode.com/show_bug.cgi?id=14474 >> >> It seems an LCB library that uses OS APIs to return best guess for file >> encoding that match up with the textEncode/Decode functions would be a great >> addition to LC >> >> * "ASCII" >> * "UTF-16" >> * "UTF-16BE" >> * "UTF-16LE" >> * "UTF-32" >> * "UTF-32BE" >> * "UTF-32LE" >> * "UTF-8" >> * "CP1252" >> * "ISO-8859-1" >> * "MacRoman" >> >> and I suppose "Binary" as the default if none of the above can be detected >> >> ----------------- MARK'S REPLY ---------------------------------------- >> On 2019-09-13 16:44, Paul Dupuis wrote: >>> I have a LiveCode Script (LCS) routine that attempts to follow >>> industry common algorithms for guessing the encoding of a text file. >>> >>> It's performance can be slower than I would like. >> >> If you share your code perhaps we can help speed it up... >> >>> This has led me to wonder in a LiveCode Builder (LCB) library may be >>> the route to go. Does anyone know the OSX and/or Windows APIs for >>> guessing a text file's encoding? >>> >>> I have done a number of google searches, but I am not a C programmer >>> (not in many decades) and wading through the huge doc sets at MSDN or >>> Apple is daunting. >>> >>> I found reference to a windows API: >>> >>> BOOL IsTextUnicode( >>> const VOID *lpv, >>> int iSize, >>> LPINT lpiResult >>> ); >>> >>> Which suggests to me that such APIs may exists. Does anyone who is >>> better at finding OS APIs know where to find such APIs? Can you point >>> me to the right online documentation? >> >> Libraries certainly exist: Mozilla has a 'universal charset detector >> library' for example, which appears to use various statistical heuristics to >> tell between all kinds of encodings. >> >> The 'IsTextUnicode' API seems to just tell you whether a sequence of bytes >> is likely to be UTF-16 or not UTF-16; so probably won't be all that helpful >> if that isn't all you are wanting to distinguish between. >> >> Do you have a list of encodings you are needing to guess between? That will >> generally influence how fast (and accurate) you can make such a function >> (its almost trivial to detect UTF-8 with a high degree of confidence, UTF-32 >> I think as well, UTF-16 is somewhat harder, and distinguishing between >> single-byte and legacy multi-byte charsets is, relatively speaking, very >> hard). >> >> Warmest Regards, >> >> Mark. >> >> P.S. This might be a better discussion to have on the use-list unless there >> is a reason not to, it might be of interest to others in that wider group. >> >> -- >> Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ >> LiveCode: Everyone can create apps >> >> _______________________________________________ >> livecode-dev mailing list >> livecode-...@lists.runrev.com >> http://lists.runrev.com/mailman/listinfo/livecode-dev >> >> _______________________________________________ >> use-livecode mailing list >> use-livecode@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your subscription >> preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode >> > > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode